Inspiration
Call centers still rely heavily on human agents for repetitive, low-complexity questions like order status or basic troubleshooting. We wanted to see how far a lightweight, browser-only voice assistant could go using nothing but native web APIs and a modern LLM.
What it does
VoxAgent simulates an AI call center agent. It greets the caller out loud, listens through the microphone, transcribes speech, sends it to Google's Gemini model for a response, and speaks the reply back — all in a continuous voice loop that feels like a real phone call.
How we built it
- Speech I/O: Browser-native Web Speech API (SpeechRecognition for input, SpeechSynthesis for output) — no audio backend needed.
- AI brain: Google Gemini API (gemini-2.0-flash) handles natural language understanding and generates short, conversational replies suited for speech.
- Frontend: Plain HTML/CSS/JavaScript with a single animated "orb" UI that visually reflects listening vs. speaking states.
- Architecture: Fully client-side — the API key is entered locally in-browser and calls go directly from the client to Gemini, with no server in between.
Challenges we ran into
- Handling the Web Speech API's
no-speechtimeout correctly — initially any silence was treated as a fatal error, so we rewrote the error handling to distinguish a normal pause from real failures (permissions, network, no mic). - Debugging silent API failures — Gemini was returning structured error responses (e.g. quota limits) that were getting swallowed by a generic fallback message, so we added proper error surfacing to make failures debuggable.
Accomplishments that we're proud of
Getting a fully working, natural-feeling voice conversation loop running with zero backend infrastructure — just static HTML and two browser APIs talking to an LLM.
What we learned
How forgiving (or unforgiving) browser speech APIs can be, and how important clear error surfacing is when chaining together multiple async, real-time systems (mic → recognition → LLM → speech synthesis).
What's next for VoxAgent
- Support for interruption (barge-in) so callers can cut off the assistant mid-sentence
- Multi-language support
- Optional handoff/escalation flow to a real human agent
Built With
- css3
- google-gemini
- html5
- javascript
- web-speech-api
Log in or sign up for Devpost to join the conversation.