S.O.N.A.

Inspiration

We’ve all dreamed of having our own “JARVIS” ever since watching Iron Man. Today, with Whisper’s state-of-the-art speech recognition and Gemini’s powerful reasoning API, that vision is within reach.

What It Does

SONA awakens at the sound of “Hey SONA,” instantly transcribes your voice with Whisper, and routes your query through our MCP framework to a conversational AI (Gemini or Sesame). Replies stream back in real time via a familiar chat-style interface—what you say, think, and do, all in one seamless voice loop.

# SONA in three steps:
# 1) listen ("Hey SONA")
# 2) think (Whisper → Gemini)
# 3) do (respond via chat UI)

How We Built It

In our React/Next.js frontend, we use an AudioWorklet to capture 16 kHz PCM and run Picovoice Porcupine in-browser for sub-10 ms wake-word detection, then stream raw audio buffers over WebSockets to our Flask-SocketIO service. There, whisper.cpp provides live transcription, our FastMCP JSON-RPC framework orchestrates tool-enabled reasoning with Google Gemini or Sesame, and the AI’s replies are pushed back to the client for display in a chat-style UI—complete with seamless on-the-fly model switching.

Technology Stack

Layer	Technologies
Frontend	React & Next.js (TypeScript), Web Audio API / AudioWorklet (16 kHz PCM capture), Picovoice Porcupine (WASM wake-word), Socket.IO client, Chat UI
Backend	Python 3, Flask & Flask-SocketIO (eventlet), whisper.cpp, FFmpeg, FastMCP JSON-RPC framework, Google Gemini & Sesame APIs, Kōkoro text-to-speech
Infrastructure & DevOps	Docker & Docker-Compose, PostgreSQL, Nginx (SSL termination & proxy), environment-variable configuration

Challenges We Faced

Building SONA meant venturing into uncharted full-stack AI territory. Our biggest hurdles were:

Taming Whisper latency
We tuned chunk sizes, FFmpeg flags, and parallel subprocesses to keep transcription under one second without overloading the CPU.
Aligning wake-word callbacks
Coordinating Picovoice Porcupine’s WASM callbacks with our AudioWorklet capture pipeline took careful timestamp alignment to ensure “Hey SONA” fired exactly when expected.
Rolling out our MCP framework
Designing a Model-Control-Protocol server and its clients was entirely new: we defined JSON-RPC “tools,” managed asynchronous calls, and plugged them into Gemini and Sesame endpoints.
Gluing front end and back end
Stitching together AudioWorklet, WebSocket streaming, Flask routing, whisper.cpp, MCP calls, and React state in one cohesive codebase tested every layer of our stack.

Overcoming these challenges gave us deep, end-to-end insight into real-time audio processing, wake-word engineering, and multi-model AI orchestration.

Accomplishments

We built a working prototype of a voice-driven AI assistant that feels fluid and responsive. The real-time transcription, model-switching capability, and chat UI all came together smoothly, giving users a futuristic voice interface powered by state-of-the-art AI.

What We Learned

We gained deep, end-to-end experience building real-time audio pipelines—from fine-tuning wake-word detection and whisper.cpp transcription to orchestrating multiple AI models via our MCP framework. We also learned that streaming raw audio over the network for backend reformatting introduced unnecessary complexity.

What's Next for SONA

Next, we plan to add semantic search over long-term memory in PostgreSQL, integrate audio-based emotion analysis, and build tighter native tooling support on macOS and Windows. Our goal is to evolve SONA into the versatile, JARVIS-style assistant these platforms still lack.

Dataflow