vox | Devpost

Landing page
one-one room
Agent room
Architecture flow
Cloud run deployment proof

Inspiration

1.5 billion people speak English as a second language. Millions of cross-border calls fail every day — not because of technology, but because of how people sound. Doctors mishear patients. Job candidates lose interviews. Customer service calls collapse. Families across borders struggle to connect.

Current solutions like Google Translate and Microsoft Translator are turn-based: you speak, you wait, the other person hears. It feels like a walkie-talkie — unnatural and broken. We wanted to build something that feels like a real phone call, where both people speak naturally, interrupt each other, and talk over each other — just like real conversation.

We also recognized that many people need more than translation — they need someone to talk to. An empathetic companion who listens, understands, and responds in their native language. That's why we built Voxa, an AI mentor and friend powered by Gemini Live.

What it does

Vox is a real-time multilingual voice communication platform with two modes:

1. Meeting Mode — Real-Time Translation Two people join a call. Each speaks naturally in their own language and accent. Vox sits invisibly between them — detecting language, translating in real-time, and resynthesizing each speaker's voice so the other person hears them clearly in their own language, in the original speaker's voice. Neither person changes anything. The agent does all the work invisibly, in real time.

2. Agent Mode — Voxa AI Companion Users can talk to Voxa, a highly intelligent, deeply empathetic AI companion powered by Gemini Live. Voxa acts like a brilliant mentor and friend on a late-night phone call — offering wise advice for life issues, powerful inspiration when motivation is lacking, and warm, human-like conversation in the user's native language (including Nigerian languages like Yoruba, Igbo, and Hausa).

Key Features:

🗣️ Real-time translation with no turn-taking
🎯 Full interruption handling — speak over each other naturally
😊 Emotion detection and preservation in translations
🎤 Voice cloning — hear translations in the speaker's own voice
📝 Live transcripts showing original and translated text
🌍 Support for 10+ languages including Nigerian languages (Yoruba, Igbo, Hausa)

How we built it

Backend (Python/FastAPI):

Built on the Pipecat framework for real-time, multimodal AI pipelines
Gemini 2.5 Flash Native Audio for real-time voice understanding, translation, and AI conversation
Silero VAD for voice activity detection and interruption handling
FastAPI WebSocket transport for low-latency audio streaming
Redis for session state management

Frontend (React/Vite):

Real-time audio capture with 100ms chunking
WebSocket connection for bidirectional audio streaming
Live audio visualization and transcript display
TailwindCSS for responsive UI

Infrastructure:

Docker containerization
Google Cloud Run for serverless hosting
Infrastructure as code with deploy.sh

The Pipeline:

Speaker speaks → Gemini Live (transcribe + translate) → Voice synthesis → Partner hears in their language

For the AI companion mode, we use Gemini Live's native audio capabilities with a carefully crafted system prompt that makes Voxa feel like a warm, empathetic human friend rather than a robotic assistant.

Challenges we ran into

Interruption Handling: The hackathon explicitly required interruptible agents. Making two audio streams work simultaneously without blocking each other was complex. We solved this with Pipecat's native VAD integration and careful async task management.
Inactivity Detection: Our initial implementation triggered false inactivity warnings even during active conversations. The timer wasn't resetting when the agent was speaking. We fixed this by resetting the timer on both user speech AND agent responses.
Pipecat's Idle Timeout: Pipecat has its own built-in idle timeout that was cancelling pipelines unexpectedly. We had to disable it and implement our own custom inactivity checker.
Audio Latency: Achieving real-time feel required careful optimization of chunk sizes (100ms), WebSocket binary streaming, and avoiding any queuing that would add delay.

5. Nigerian Language Support: Few AI tools support Yoruba, Igbo, and Hausa. We leveraged Gemini's multilingual capabilities to make this work, which became a key differentiator.Though not perferct but close to it

Accomplishments that we're proud of

True Real-Time Conversation: No turn-taking. Both people can speak, interrupt, and talk over each other naturally — just like a real phone call.
Voxa AI Companion: Created an AI that genuinely feels like talking to an empathetic friend, not a chatbot. It remembers your name, offers wise advice, and speaks in your native language.
Nigerian Language Support: We're one of the few voice AI tools that support Yoruba, Igbo, and Hausa — serving over 100 million speakers often overlooked by technology.
Seamless Interruption Handling: When someone interrupts, Vox switches instantly. No lag, no awkward pause. The StatusIndicator visually proves this is working in real-time.
Voice Preservation: Translations maintain the speaker's emotional tone and natural speaking style.

What we learned

Pipecat is powerful: The framework handles the complexity of real-time audio pipelines, VAD, and interruption handling elegantly.
Gemini Live is transformative: Native audio understanding eliminates the traditional STT→LLM→TTS pipeline latency.
Interruption handling is everything: For voice AI, the difference between turn-based and interruptible is the difference between feeling robotic and feeling human.
Edge cases matter: Inactivity detection, reconnection handling, and error recovery are where real-world applications succeed or fail.
Representation matters: Supporting underserved languages like Yoruba, Igbo, and Hausa isn't just a feature — it's a statement about who technology should serve.

What's next for Vox

Voice Cloning: Full voice profile capture and synthesis so translations sound exactly like the original speaker.
Lip Reading: Use Gemini Vision to supplement audio in noisy environments for higher accuracy.
Mobile Apps: Native iOS and Android apps for on-the-go multilingual calls.
Enterprise Integration: API for businesses to embed Vox in customer service, telemedicine, and international collaboration tools.
More Languages: Expand to 50+ languages with focus on underserved African and Asian languages.
Voxa Specializations: Domain-specific AI companions — mental health support, language tutoring, career coaching — all in the user's native language.
Group Calls: Support for multilingual conference calls with multiple simultaneous translations.

Built With

docker
es6
fastapi
gemini
google-cloud
python
rnnoise
silerovad
uvicorn

Updates

ALUKO FOLAJIMI started this project — Mar 16, 2026 06:25 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.