Inspiration

We wanted a voice assistant that feels like a real teammate: hands-free, multi-language, and able to handle real work (todos, calendar, mortgage, healthcare) instead of simple Q&A. We built Convonet to turn browsers, voice, cloud, and tools into one coherent agent.

What it does

Convonet is a voice AI productivity assistant. You speak in 30+ languages; it manages todos, calendar, teams, mortgage applications, and healthcare lookups. It uses 38 MCP tools, supports Claude/Gemini/OpenAI, and can transfer calls to human agents. It runs in the browser via LiveKit WebRTC and is deployed in production.

How we built it

We use LangGraph for agent orchestration, Flask + SocketIO for real-time voice, and LiveKit for low-latency WebRTC. STT is Deepgram/Modulate; TTS is Deepgram/ElevenLabs/Cartesia. MCP tools run as subprocesses; Redis stores sessions and state. Tavily powers web search; Twilio/FusionPBX handle call transfer.

Challenges we ran into

Multi-provider streaming: Different streaming APIs for Claude, Gemini, and OpenAI required a shared abstraction. MCP env vars: MCP subprocesses didn’t inherit env vars; we had to merge parent env with config. Voice UX for web search: Long, markdown-heavy results were hard to listen to; we added summary-first formatting, markdown stripping, and ~30s chunks. Barge-in: “Stop” during playback needed both VAD detection and server-side handling of transcribed “stop”.

Accomplishments that we're proud of

Production deployment with multi-LLM support and fallbacks. Domain-specific agents (productivity, mortgage, healthcare) with sticky context. Real-time web search via Tavily with voice-friendly formatting. Emotion-aware TTS and multi-language support (30+ languages). 38 MCP tools orchestrated through a single LangGraph pipeline.

What we learned

MCP subprocess env handling differs from standard subprocess behavior; explicit env merging is important. Voice UX needs different formatting than text: no markdown, summary first, chunked playback. Multi-provider support reduces vendor lock-in and improves reliability. Barge-in and “stop” require both client-side VAD and server-side intent handling.

What's next for Convonet Voice AI

RAG over internal docs and knowledge bases. Streaming STT/TTS for lower latency. Voice cloning for personalized assistants. Mobile app and phone-first flows.

Built With

Share this project:

Updates