Inspiration

As AI becomes more common, two issues still hold people back: privacy and reliance on the cloud. Most assistants send personal data to remote servers and stop working without a connection. We wanted to change that. Modern phones are incredibly powerful, so why shouldn’t they run their own intelligent assistant?

PrometheusAI started with one goal: give every user a private, offline “Second Brain” that stays on their device, works anywhere, and responds instantly.

What It Does

PrometheusAI is a fully offline Android assistant powered by modern Large Language Models running on-device.

Key features:

Offline Chat: Natural, multi-turn conversations without any connection. Thinking Mode: A reasoning workflow that breaks down complex prompts step by step before answering. Privacy Vault: All inference happens on the phone, keeping sensitive queries secure. Low Latency: No network delay, delivering fast token generation.

How We Built It

The app is written in Kotlin with Jetpack Compose for a smooth, reactive interface. The intelligence layer uses ONNX Runtime GenAI, optimized for Arm devices.

Technical highlights:

Arm NEON Acceleration: Uses SIMD instructions to speed up the transformer’s matrix operations. Int8 Quantization: Reduces model size and memory use while preserving quality, allowing multi-billion-parameter models to run on a phone. Heterogeneous Compute: Heavy inference tasks run on the high-performance cores, while UI tasks stay responsive on the efficiency cores.

Stack:

Language: Kotlin UI: Jetpack Compose Inference: ONNX Runtime GenAI Models: Qwen 2.5 and Qwen 3 (converted to ONNX) Architecture: MVVM with Coroutines for streaming inference

Challenges

Memory Constraints: Large models pushed devices into OOM errors. We tuned Gradle heap settings and adjusted ONNX session options to keep memory stable. Thermal Throttling: Long inference runs generated heat, so we balanced speed with device thermal limits. Model Conversion: Getting Qwen models into ONNX with proper tokenization required manual adjustments and careful handling of special tokens.

Accomplishments

Running a modern LLM fully offline on a phone. Implementing a functioning reasoning mode that elevates responses beyond simple queries. Delivering a smooth UI where users can scroll and type while the model streams tokens in real time.

What We Learned

Modern Arm chips are far more capable than expected when paired with NEON optimizations and quantization. Int8 quantization is a powerful tool for edge AI, offering strong performance with minimal resource use. On-device AI will play a huge role in the future of personal assistants.

What’s Next

NPU Support: Offload inference to dedicated neural units for efficiency and speed. Multimodal Input: Add on-device vision so users can chat about images stored locally. RAG: Enable the assistant to read local documents and notes for more personalized responses without leaving the device.

Built With

Share this project:

Updates