model answer example
main chat screen
added to selected text bar, you can access directly from select bar
chat bubble screen
minimized chat bubble
resource monitoring, chat bubble screen,

Mobile-AI-Assistant

On-device AI assistant with flexible model selection. Pick models that fit your phone's resources and switch anytime for different tasks. 100% offline & private.

Inspiration

We were frustrated by the current state of mobile AI apps. Most either require constant internet connectivity, send your private conversations to cloud servers, or lock you into a single model that may not work well on your device. We asked ourselves: Why can't users choose their own AI model based on their phone's capabilities?

The Arm AI Developer Challenge gave us the perfect opportunity to build what we envisioned — a truly flexible, privacy-first AI assistant that puts users in control. We wanted to democratize on-device AI so that someone with a budget phone could run a lightweight 0.5B model, while someone with a flagship could leverage a powerful 7B+ model — all within the same app.

What it does

Mobile-AI-Assistant is a fully offline AI chatbot that runs large language models locally on Arm-powered Android devices. Key capabilities include:

Flexible Model Selection: Load ANY GGUF model — users choose models that match their device's RAM and CPU capabilities
Task-Specific Switching: Hot-swap models without reinstalling — use a coding model for programming, a creative model for writing
System-Wide AI Access: Floating assistant bubble accessible from any app, plus text selection integration
Real-Time Resource Monitoring: Live CPU/memory tracking so users can see exactly how their device handles the model
Complete Privacy: Zero internet required, zero data transmission — conversations never leave your device

How we built it

Architecture Stack:

UI Layer: Jetpack Compose with Material 3 design system
Business Logic: Kotlin with MVVM architecture, Coroutines for async operations
Inference Layer: Custom multi-engine abstraction supporting GGUF (llama.cpp) and ExecuTorch
Native Layer: C++ with JNI bindings to llama.cpp inference engine

Efficient Autoregressive Inference

To enable responsive long-form conversations on mobile devices, the assistant leverages key–value (KV) cache reuse during autoregressive decoding. Instead of recomputing attention for all previous tokens at every step, intermediate attention states are cached and reused, ensuring decoding cost grows linearly rather than quadratically with conversation length.

This design allows the assistant to maintain consistent latency even as context grows, which is critical for on-device AI under mobile memory and thermal constraints.

Arm Optimizations:

Enabled ARM NEON SIMD instructions for 4-8x faster matrix operations
Compiled with -O3 -ffast-math -march=armv8-a+simd flags
Disabled unnecessary features (CURL, OpenMP) to reduce binary size and battery drain
Implemented greedy sampling for fastest token generation
Optimized context window (2048 tokens) for mobile memory constraints

Key Technical Decisions:

Built exclusively for arm64-v8a to maximize Arm-specific optimizations
Used streaming token generation for responsive UX even on slower devices
Implemented overlay service for system-wide floating assistant
Created abstraction layer to support multiple inference engines (GGUF now, ExecuTorch ready)

Challenges we ran into

Memory Management: LLMs are memory-hungry. We had to carefully manage native heap allocation and implement proper cleanup to prevent OOM crashes on devices with limited RAM.
JNI Complexity: Bridging Kotlin and C++ through JNI was tricky. Debugging crashes in native code required careful logging and understanding of both memory models.
Build System: Getting CMake, NDK, and Gradle to play nicely together with llama.cpp as a subproject took significant configuration effort.
Model Compatibility: Different GGUF models have different requirements. We had to handle various quantization formats and ensure graceful failures for incompatible models.
Overlay Permissions: Android's overlay permission system is complex. Making the floating assistant work reliably across different Android versions and OEM skins required extensive testing.
Thermal Throttling: Sustained inference causes devices to heat up and throttle. We implemented resource monitoring to help users understand performance variations.

Accomplishments that we're proud of

True Model Flexibility: We achieved our core vision — users can load ANY compatible GGUF model, giving them unprecedented control over their mobile AI experience
Production-Quality UX: The app feels polished with streaming responses, smooth animations, and intuitive controls — not just a tech demo
System-Wide Integration: The floating assistant and text selection features bring AI to every app on the phone, not just our own
Performance on Arm: Achieved 8-15 tokens/second on modern Arm devices with proper NEON optimizations — fast enough for real conversations
Zero Cloud Dependencies: The app works completely offline with no analytics, tracking, or data collection — true privacy
Clean Architecture: The multi-engine abstraction means we can easily add new inference backends (ExecuTorch, ONNX) in the future

What we learned

Arm NEON is powerful: Properly leveraging SIMD instructions makes a massive difference in inference speed — it's not just a checkbox feature
Mobile constraints require creativity: Limited RAM, thermal throttling, and battery concerns forced us to think differently about LLM deployment
User control matters: Giving users the ability to choose their own models based on their device creates a much better experience than one-size-fits-all
Native code is worth it: The performance gains from C++/JNI integration far outweigh the development complexity for compute-intensive tasks
Privacy can be a feature: In an era of cloud AI, running everything locally is a genuine differentiator that users appreciate

What's next for Mobile-AI-Assistant

Voice Input/Output: On-device speech recognition and text-to-speech for hands-free conversations
In-App Model Browser: Download and manage models directly from HuggingFace without leaving the app
RAG Support: Load documents and PDFs for context-aware Q&A — your personal knowledge base
Conversation History: Multiple chat threads with persistent history and search
Prompt Templates: Pre-built prompts for common tasks (summarization, translation, code review)
Home Screen Widget: Quick access to AI without opening the full app
Wear OS Companion: Voice-activated AI assistant on your smartwatch
More Inference Engines: Full ExecuTorch integration, ONNX Runtime support for even more model options

Built with ❤️ for the Arm AI Developer Challenge 2025

Built With

android
android-lifecycle
arm-neon-simd
c++
cmake
executorch
ggml
gguf
jetpack-compose
jni
kotlin
kotlin-coroutines
llama.cpp
material-3

Updates

Hussain Nazary started this project — Dec 03, 2025 05:54 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.