Voice Training Agent

Inspiration

We're starting out in our part-time jobs, learning sales techniques and building product knowledge. One of the hardest skills to pick up isn't memorizing a script — it's handling real customers professionally when the situation, tone, and problem all change at once.

We built Voice Training Agent so we could practice that safely: talk to different types of customers, work through realistic problems, and get feedback before we're on a live call. The goal is to grow communication and problem-solving skills the same way you'd drill any other skill — with repetition, variety, and honest review.

What it does

The app gives employees a full practice loop in three parts:

Personas — Users pick or create a customer persona that sets the tone for the voice agent. Personas can be AI-generated (researched from company docs or the web, then reviewed by a critic agent) or created manually. Each persona includes scenario context, emotional patterns, and a win condition so practice feels specific, not generic.

Practice — A live voice session powered by Gemini Live. The AI plays the customer; the employee practices how they'd handle the conversation in real time. Transcripts are saved automatically when the call ends.

Coaching — After each session, a coach agent generates a report on how well you did: what you communicated well, what you missed, metrics tied to the persona rubric, what the customer wanted to hear, and how they were likely feeling throughout the call. The feedback is designed from both the employee and customer perspective — not just "you said X" but "here's how that landed."

Knowledge — An agentic RAG chatbot answers questions about a particular company while employees practice. If someone isn't sure about a return policy, membership tier, or escalation path, they can ask mid-session instead of guessing. Answers are grounded in ingested company documents with cited sources.

Together, this is: learn the company → pick a customer → practice the call → understand what to improve.

How we built it

Architecture

Browser
  ├─ HTTPS /api/*  →  Cloud Run: voice-training-api (FastAPI + React SPA)
  │                      ├─ chat_agent           → Mongo hybrid RAG
  │                      ├─ persona_generator    → search / web_search / critic
  │                      ├─ coach_agent          → post-call analysis
  │                      └─ Mongo                → chunks, personas, transcripts, coach_reports
  └─ WSS voice     →  Cloud Run: gemini-proxy  →  Vertex Gemini Live

The React frontend (apps/voice-training/) and FastAPI backend ship as one Cloud Run service. The SPA builds to web/dist and is served on the same origin as /api/*, so there's no CORS complexity. Voice is separate: browsers can't authenticate to Vertex directly, so a lightweight gemini-proxy WebSocket service adds GCP credentials and bridges audio to Gemini Live.

Tech stack

Layer	Choices
Agents	Google ADK on Vertex AI
LLMs	`gemini-2.5-flash-lite` (chat, search, persona, coach); `gemini-live-2.5-flash-native-audio` (voice)
Embeddings	`gemini-embedding-001` (768-dim)
Backend	FastAPI + Uvicorn, Python 3.12
Frontend	React 19 + Vite 7 + Framer Motion
Database	MongoDB Atlas — vector search, full-text search, document store
Deploy	Google Cloud Run + Cloud Build

Multi-agent design

We spent significant time on agent architecture — aiming for agents that are simple, focused, and efficient rather than one giant prompt doing everything. Six ADK agents power the backend:

Agent	Role
Chat	Knowledge Q&A. Always searches Mongo before answering; returns prose + cited sources.
Search	Agentic RAG for persona research — decomposes goals, hybrid retrieval, structured brief.
Web search	Fallback when no company docs exist; Gemini Google Search grounding.
Persona generator	Orchestrator: research → draft → critic loop → validate → save to Mongo.
Persona critic	Quality gate on consistency, grounding, realism, rubric, and ethics.
Coach	Post-call analysis: gaps, frustration timeline, rubric self-check, improvements.

The persona pipeline streams progress to the UI over Server-Sent Events. Sub-agents run in isolated ADK sessions so each keeps its own tools and context.

RAG pipeline

Company knowledge is ingested from markdown support docs (demo datasets for Olive Young and Stripe):

Structure-aware chunking on markdown headings with breadcrumb context
Embed with gemini-embedding-001 (separate query/document task types)
Store in MongoDB chunks, scoped by company_id
Retrieve via hybrid search — Atlas Vector Search + full-text search, fused with Reciprocal Rank Fusion

We built scripts/run_chat_eval.py and a 22-question gold eval set to measure retrieval and answer quality under time pressure.

Voice + coaching flow

POST /api/live/session returns a signed token, proxy URL, and voice prompt built from the persona
Browser connects to gemini-proxy → Vertex Live for bidirectional audio + transcription
Transcript saves to Mongo on call end
Coach agent loads transcript + persona and generates a structured report

Tooling and workflow

We used Cursor and Claude Code heavily for iteration, plus the MongoDB MCP server to explore schemas and debug queries during integration. Deploy is a single script: ./scripts/deploy/deploy.sh (frontend + backend bundled; optional --with-proxy for voice).

Challenges we ran into

Agent architecture — We spent a lot of time planning and refining the design: which agents own which tools, how the persona pipeline chains sub-agents, and where to draw boundaries so each agent stays simple. Over-engineering early cost us time; the final design — specialized agents with clear handoffs — was worth the iteration.

MongoDB integration — Wiring Atlas Vector Search, full-text indexes, tenant-scoped collections, and the application layer was confusing at first. The MongoDB MCP server helped us inspect data and validate queries, but understanding how search indexes, embeddings, and the Python driver fit together took real debugging.

Agentic RAG quality — Hard to validate retrieval quality in a short hackathon window, especially with limited demo data. Wrong chunks can produce confident-sounding wrong answers. We addressed this with hybrid search, a critic loop for personas, citation requirements in the chat agent, and an eval script — but RAG tuning is still an open problem as we add more companies.

Voice on the web — Browsers can't call Vertex Live directly. We built a signed-token WebSocket proxy, Web Audio worklets for capture/playback, and guardrails (rate limits, prompt caps, SSRF protection on the proxy).

Accomplishments that we're proud of

Shipping something complete and functional — Not just a demo slide or a single agent in isolation, but an end-to-end product: knowledge chat, persona generation, live voice practice, and coaching reports, all deployed and usable.

Seeing the multi-agent design work in practice — Watching the persona pipeline research, draft, get critiqued, revise, and save — then using that persona in a live call and getting a coach report back — felt genuinely rewarding.

Team growth — Several of us were still learning the stack (Google ADK, MongoDB, Cursor, Claude Code) during the hackathon. Everyone contributed meaningfully, and we're proud of how quickly the team leveled up together.

What we learned

Beyond the tools — ADK, MongoDB, AI-assisted development — the bigger lesson was how to build as a team. You can only go so far alone; with clear ownership, async communication, and trust, the ceiling is much higher.

We also learned what it means to think like product engineers: not just "can we build it?" but "does this solve a real problem for someone practicing customer conversations?" That product sense — tying features back to how employees actually learn — shaped what we kept in scope and what we cut.

What's next for Voice Training Agent

Auth and RBAC — Manager vs. employee roles. Managers see coaching reports across their team; employees see their own history and progress.

Company-specific onboarding — Ingest a customer's real support docs, ticket patterns, and policies to refine agentic RAG, persona win conditions, and emotional escalation curves based on actual interactions — not synthetic demo data.

Manager dashboard — Cohort analytics: common weaknesses, improvement trends, which persona types reps struggle with most.

Richer practice experience — Image or video avatars so employees practice against a visual "customer" with facial expressions and emotional cues, not just a waveform on a screen.

Targeted drills — Auto-generate follow-up scenarios from coach report weaknesses so reps can immediately practice what they missed.

Built With

cloud-build
docker
fastapi
framer-motion
gemini
gemini-embeddings-for-rag).-the-react-frontend-uses-vite-and-framer-motion;-the-fastapi-backend-serves-the-spa-and-agent-apis-on-google-cloud-run
gemini-live-for-voice
generative-ai
generative-ai-nice-to-have-framer-motion
google-adk
google-cloud-run
google-cloud-run-strong-adds-python
javascript
mongodb
pydantic
python
rag
react
server-sent-events
vertex-ai
vite
websocket
with-a-separate-cloud-run-websocket-proxy-for-real-time-voice.-company-knowledge-lives-in-mongodb-atlas-with-vector-+-full-text-hybrid-search.-persona-generation-streams-progress-via-server-sent-events;-deploy-uses-docker

Updates

Private user started this project — Jun 11, 2026 10:08 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.