Mobile-AI-Assistant
On-device AI assistant with flexible model selection. Pick models that fit your phone's resources and switch anytime for different tasks. 100% offline & private.
Inspiration
We were frustrated by the current state of mobile AI apps. Most either require constant internet connectivity, send your private conversations to cloud servers, or lock you into a single model that may not work well on your device. We asked ourselves: Why can't users choose their own AI model based on their phone's capabilities?
The Arm AI Developer Challenge gave us the perfect opportunity to build what we envisioned — a truly flexible, privacy-first AI assistant that puts users in control. We wanted to democratize on-device AI so that someone with a budget phone could run a lightweight 0.5B model, while someone with a flagship could leverage a powerful 7B+ model — all within the same app.
What it does
Mobile-AI-Assistant is a fully offline AI chatbot that runs large language models locally on Arm-powered Android devices. Key capabilities include:
- Flexible Model Selection: Load ANY GGUF model — users choose models that match their device's RAM and CPU capabilities
- Task-Specific Switching: Hot-swap models without reinstalling — use a coding model for programming, a creative model for writing
- System-Wide AI Access: Floating assistant bubble accessible from any app, plus text selection integration
- Real-Time Resource Monitoring: Live CPU/memory tracking so users can see exactly how their device handles the model
- Complete Privacy: Zero internet required, zero data transmission — conversations never leave your device
How we built it
Architecture Stack:
- UI Layer: Jetpack Compose with Material 3 design system
- Business Logic: Kotlin with MVVM architecture, Coroutines for async operations
- Inference Layer: Custom multi-engine abstraction supporting GGUF (llama.cpp) and ExecuTorch
- Native Layer: C++ with JNI bindings to llama.cpp inference engine
Efficient Autoregressive Inference
To enable responsive long-form conversations on mobile devices, the assistant leverages key–value (KV) cache reuse during autoregressive decoding. Instead of recomputing attention for all previous tokens at every step, intermediate attention states are cached and reused, ensuring decoding cost grows linearly rather than quadratically with conversation length.
This design allows the assistant to maintain consistent latency even as context grows, which is critical for on-device AI under mobile memory and thermal constraints.
Arm Optimizations:
- Enabled ARM NEON SIMD instructions for 4-8x faster matrix operations
- Compiled with
-O3 -ffast-math -march=armv8-a+simdflags - Disabled unnecessary features (CURL, OpenMP) to reduce binary size and battery drain
- Implemented greedy sampling for fastest token generation
- Optimized context window (2048 tokens) for mobile memory constraints
Key Technical Decisions:
- Built exclusively for
arm64-v8ato maximize Arm-specific optimizations - Used streaming token generation for responsive UX even on slower devices
- Implemented overlay service for system-wide floating assistant
- Created abstraction layer to support multiple inference engines (GGUF now, ExecuTorch ready)
Challenges we ran into
Memory Management: LLMs are memory-hungry. We had to carefully manage native heap allocation and implement proper cleanup to prevent OOM crashes on devices with limited RAM.
JNI Complexity: Bridging Kotlin and C++ through JNI was tricky. Debugging crashes in native code required careful logging and understanding of both memory models.
Build System: Getting CMake, NDK, and Gradle to play nicely together with llama.cpp as a subproject took significant configuration effort.
Model Compatibility: Different GGUF models have different requirements. We had to handle various quantization formats and ensure graceful failures for incompatible models.
Overlay Permissions: Android's overlay permission system is complex. Making the floating assistant work reliably across different Android versions and OEM skins required extensive testing.
Thermal Throttling: Sustained inference causes devices to heat up and throttle. We implemented resource monitoring to help users understand performance variations.
Accomplishments that we're proud of
True Model Flexibility: We achieved our core vision — users can load ANY compatible GGUF model, giving them unprecedented control over their mobile AI experience
Production-Quality UX: The app feels polished with streaming responses, smooth animations, and intuitive controls — not just a tech demo
System-Wide Integration: The floating assistant and text selection features bring AI to every app on the phone, not just our own
Performance on Arm: Achieved 8-15 tokens/second on modern Arm devices with proper NEON optimizations — fast enough for real conversations
Zero Cloud Dependencies: The app works completely offline with no analytics, tracking, or data collection — true privacy
Clean Architecture: The multi-engine abstraction means we can easily add new inference backends (ExecuTorch, ONNX) in the future
What we learned
Arm NEON is powerful: Properly leveraging SIMD instructions makes a massive difference in inference speed — it's not just a checkbox feature
Mobile constraints require creativity: Limited RAM, thermal throttling, and battery concerns forced us to think differently about LLM deployment
User control matters: Giving users the ability to choose their own models based on their device creates a much better experience than one-size-fits-all
Native code is worth it: The performance gains from C++/JNI integration far outweigh the development complexity for compute-intensive tasks
Privacy can be a feature: In an era of cloud AI, running everything locally is a genuine differentiator that users appreciate
What's next for Mobile-AI-Assistant
Voice Input/Output: On-device speech recognition and text-to-speech for hands-free conversations
In-App Model Browser: Download and manage models directly from HuggingFace without leaving the app
RAG Support: Load documents and PDFs for context-aware Q&A — your personal knowledge base
Conversation History: Multiple chat threads with persistent history and search
Prompt Templates: Pre-built prompts for common tasks (summarization, translation, code review)
Home Screen Widget: Quick access to AI without opening the full app
Wear OS Companion: Voice-activated AI assistant on your smartwatch
More Inference Engines: Full ExecuTorch integration, ONNX Runtime support for even more model options
Built with ❤️ for the Arm AI Developer Challenge 2025
Log in or sign up for Devpost to join the conversation.