Voice Notes

Inspiration

Privacy concerns with cloud-based voice assistants and note-taking apps motivated me to explore on-device AI. I wondered: could I build a fully functional voice notes app that never sends data to the cloud? Starting with the whisper.cpp Android example, I set out to prove that powerful AI features like transcription, summarization, and intelligent Q&A could run entirely offline on a smartphone.

What it does

Voice Notes is a privacy-first Android app that transforms spoken words into intelligent, searchable notes using 100% on-device AI:

Transcribes speech to text with timestamps using Whisper.cpp
Generates summaries of your recordings using a local LLM (Gemma 3 1B)
Answers questions about your transcriptions using RAG (Retrieval Augmented Generation)
Semantic search across all notes using text embeddings
Audio playback with seekable controls and timestamp navigation

Everything runs offline—no internet required, no cloud processing, complete privacy.

How we built it

Tech Stack

Kotlin + Jetpack Compose for modern Android UI with Material 3 design
whisper.cpp compiled as native library (JNI) for efficient speech recognition
Google AI Edge (MediaPipe) for on-device LLM inference (Gemma 3 1B INT4 quantized)
ONNX Runtime for text embeddings (all-MiniLM-L6-v2, 384 dimensions)
Room Database for local storage with Flow-based reactive updates
Kotlin Coroutines for async operations without blocking UI

Key Implementation

The RAG system splits long transcriptions into 1500-character chunks, generates embeddings, and retrieves the top 4 most relevant chunks using cosine similarity before sending context to the LLM. This keeps memory usage under 8000 characters (~2000-2500 tokens) to prevent OOM crashes on mobile devices.

Challenges we ran into

Finding an LLM that works on edge devices: The biggest challenge was identifying an LLM that could run efficiently on older devices like the Galaxy S20 while maintaining reasonable performance. I tested multiple models and configurations, evaluating inference speed, memory consumption, and output quality. Finding the right balance between model capability and device constraints required extensive experimentation. Ultimately, Gemma 3 1B INT4 quantized proved to be the sweet spot—small enough to fit in memory with aggressive chunking, yet powerful enough to generate meaningful summaries and answer questions about transcriptions on resource-constrained hardware.

Accomplishments that we're proud of

Fully functional RAG on mobile — Implementing semantic retrieval with embeddings for intelligent Q&A
Zero network dependencies — Everything runs offline after initial model download
Clean UI/UX — Material 3 design with waveform visualization and smooth animations
Production-ready — Handles edge cases, proper error handling, persistent storage
Extended whisper.cpp — Transformed a simple example into a complete app with database, LLM, and RAG

What we learned

Quantization is essential for mobile — INT4 models make LLMs practical on phones
RAG solves context limits — Semantic retrieval is more effective than truncation for long text
On-device AI is viable — Modern Android devices can run surprisingly capable AI models
JNI memory limits are real — Mobile apps need aggressive memory optimization for large models
Jetpack Compose is powerful — Building complex UIs with state management is much cleaner than XML

What's next for Voice Notes

Speaker diarization — Identify different speakers in conversations
Continuous recording mode — For meetings and lectures with automatic chunking
Export/backup — Share notes as text, audio, or combined PDF
Smaller models — Experiment with distilled versions for faster inference
Multi-language support — Leverage Whisper's multilingual capabilities
Voice commands — "Summarize last recording" without touching the screen

Built With

android-studio
gemma
kotlin
room
whisper

Updates

Parth Bhanderi started this project — Dec 03, 2025 08:22 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.