๐ Project Story โ DevVoice
๐ก Inspiration
I'm someone who talks to their home. Lights, plugs, music โ I want it all controlled by voice, naturally, the way I actually speak.
The problem? Every mainstream voice assistant โ Alexa, Google Home, Siri โ is built for one language, one accent, one culture. When I say "plug band karo" (turn off the plug, in Hindi), they stare blankly. When I switch mid-sentence between Hindi and English the way most Indians do daily, they give up entirely.
I also didn't want a giant cloud box listening to everything in my home, phoning data back to a corporation. I wanted something mine โ running on my own machine, talking to my own devices, in my own words.
That's what sparked Dev โ a personal voice assistant that understands how I speak, not how Silicon Valley thinks I should.
๐๏ธ How I Built It
The Pipeline
Every voice command travels through a carefully orchestrated pipeline:
Microphone โ VAD โ STT โ Wake Detection โ LLM Intent โ Action โ TTS โ Speaker
Each stage was built and integrated independently:
1. ๐ค Voice Activity Detection (VAD)
I used WebRTC VAD โ the same algorithm powering Google Meet โ to detect when the user is actually speaking. It classifies short audio frames as speech or silence using energy thresholds and spectral analysis.
The energy-based silence gate:
$$E = \frac{1}{N} \sum_{i=1}^{N} x_i^2$$
where $x_i$ are audio samples and $N$ is the frame size (typically 480 samples at 16kHz). Frames with $E$ below a tuned threshold are discarded before sending to the STT API โ saving both latency and API costs.
2. ๐ฃ๏ธ Speech-to-Text โ Sarvam AI
I chose Sarvam AI's saarika:v2.5 model specifically because it's trained on Indian languages and accents. It handles Hindi, transliterated Hindi (Hinglish), and Indian-accented English far better than generic STT APIs.
3. ๐ Wake Word Detection
A lightweight regex + string matching layer checks for phrases like "hi dev", "hey there", "hi there" before invoking the expensive LLM. This keeps the system efficient โ the LLM only runs when the user is actually commanding Dev.
4. ๐ง Intent Classification โ Google Gemini API
The heart of the system. I engineered a detailed system prompt that classifies transcribed speech into one of 9 structured intent types, returning strict JSON:
$$\text{intent} = \arg\max_{t \in \mathcal{T}} \, P(t \mid \text{transcript}, \text{context})$$
where $\mathcal{T} = {\text{control, reminder, note_add, note_read, calendar_list, query, music, chat, unknown}}$.
Using gemini-2.0-flash gave me fast, cheap, and highly accurate intent parsing โ no GPU, no local 17GB model, just an API call.
5. โก Action Routing
A clean router in server.py dispatches each intent to the right service โ Tuya local API for smart plugs, yt-dlp for music, Google Calendar API for reminders, wttr.in for weather, DuckDuckGo for web search.
6. ๐ Text-to-Speech โ Sarvam AI
Responses are synthesized using Sarvam AI's bulbul:v2, which produces natural Indian-accented speech and supports Hindi text natively.
7. ๐ Real-time Frontend
A React + Vite dashboard shows live state (idle โ listening โ thinking โ speaking) via WebSocket. The animated voice orb gives the assistant a visual presence.
๐ What I Learned
Prompt Engineering is Everything
Getting Gemini to return only valid JSON โ no markdown fences, no explanations, no apologies โ required iterative prompt refinement. The key insight: show, don't tell. Filling the prompt with 20+ concrete examples was far more effective than writing rules in prose.
API Design Matters for Latency
Running STT, LLM, and TTS sequentially means latency compounds. I used asyncio + run_in_executor to keep the FastAPI server non-blocking, and WebSocket state broadcasts to keep the frontend feeling responsive even when the pipeline takes 2โ3 seconds end-to-end.
Bilingual NLP is Hard
Hindi text from Sarvam sometimes arrives in Devanagari (เคเคพเคจเคพ เคฌเคเคฆ เคเคฐเฅ), sometimes in transliterated Latin (gana band karo), and sometimes mixed. Gemini handles this remarkably well โ but only if your prompt explicitly trains it on both forms with real examples.
Local IoT Integration
TinyTuya communicates with smart plugs over the local LAN using an encrypted protocol โ no Tuya cloud required. Getting the device's local_key involved reverse-engineering the Tuya mobile app pairing flow โ a rabbit hole worth going down for the privacy win.
๐ง Challenges
1. Wake Word False Positives
Sarvam AI mishears English words in surprising ways. "Hey there" was getting transcribed as "high there" or "hi dear". Solution: maintain a list of phonetically similar mishearing variants in config.py and match against all of them.
2. VAD Tuning
WebRTC VAD's aggressiveness (0โ3) needed careful tuning. Too sensitive โ cuts off the end of commands. Too loose โ sends silence frames to STT wasting API quota. I landed on aggressiveness=2 with a trailing silence buffer of 0.8 seconds.
3. Switching from Ollama to Gemini API
The original architecture used a local Ollama + Gemma 4 26B setup (~17GB model, required a powerful Mac). While powerful, it was inaccessible for most users. Migrating to the Google Gemini API made the assistant:
- โ Accessible on any machine (no GPU needed)
- โ Faster (network RTT < local model inference time)
- โ Cheaper (Gemini Flash free tier is generous)
The refactor required abstracting the LLM interface cleanly โ one _gemini_chat() function replacing all httpx calls to Ollama's local REST API.
4. Music Playback State
Knowing whether music is currently playing โ without blocking the async server โ required spawning yt-dlp + afplay as a subprocess and polling proc.poll(). Keeping wake-word detection alive during music (so users can say "stop" without the wake phrase) needed a music-aware override in the wake state logic.
๐ What's Next
- Streaming TTS โ start speaking before the full audio is synthesized
- On-device wake word โ a tiny TensorFlow Lite model to detect "Hey Dev" purely offline
- Multi-device support โ control more than one Tuya device
- Memory โ let Gemini remember context across conversation turns
- Mobile companion app โ trigger Dev from your phone
Dev is proof that a personal AI assistant doesn't have to be a black box from a tech giant. It can be yours โ built by you, running for you, speaking your language.
โ Basudev ๐
Built With
- duckduckgo-search-api
- fastapi
- google-calendar-api
- google-gemini-api
- google-generativeai
- httpx
- javascript
- macos
- numpy
- python
- react
- sarvam-ai
- scipy
- sounddevice
- tinytuya
- uvicorn
- vite
- webrtc-vad
- websocket
- wttr.in
- yt-dlp
Log in or sign up for Devpost to join the conversation.