Inspiration
Sales training is notoriously difficult to scale. It usually relies on expensive human coaching or passive video courses that don't build muscle memory. We looked at existing "voice AI" roleplay tools and found them frustratingly robotic—the lag between speaking and hearing a response broke the immersion. We wanted to build the "Flight Simulator for Sales": a training environment so realistic, responsive, and emotionally intelligent that users forget they are talking to an AI.
What it does
SalesCoach AI is a real-time voice training platform where sales reps practice pitches against diverse AI personalities (e.g., "The Skeptic," "The Price-Sensitive," "The Enthusiast").
- Live Roleplay: Users speak naturally to the AI, which listens, interrupts, thinks, and responds instantly with human-like intonation.
- Instant Evaluation: As soon as the call ends, the system generates a comprehensive performance report efficiently grading the user on empathy, objection handling, and closing techniques.
- Audio Playback: Users can review their calls and accurate transcripts to pinpoint exactly where they won or lost the deal.
How we built it
We went "all in" on the Gemini ecosystem to maximize performance and velocity:
- Core Engine: We utilized the Gemini Live API (Native Audio) for the voice interaction. Instead of chaining separate STT (Speech-to-Text) and TTS models, we stream audio directly to/from Gemini. This allows the model to hear tone and respond with emotion in a single connection.
- Evaluation: We used Gemini 2.5 Flash for the post-call analysis. Its massive context window and blistering speed allow us to process long transcripts and structured JSON feedback instantly.
- Backend/Frontend: Built with FastAPI (Python) handling the WebSockets and Next.js for the dashboard.
- Dev Speed: The entire platform was "Vibe Coded" using Gemini 3 Pro (Preview) inside GitHub Copilot. We treated the AI as a co-founder, generating complex Docker configurations and React components in minutes.
Challenges we ran into
- Latency & Interruptions: Handling real-time audio streams is hard. We had to ensure that when a user interrupts the AI, the AI stops speaking immediately. Switching to Gemini's Native Audio stream solved this elegantly compared to legacy solutions.
- Dockerizing Audio: Getting audio dependencies and WebSocket streams to play nicely inside a containerized environment across different platforms took some serious debugging.
- Prompt Engineering for Voice: Teaching the AI to "sound" natural (using fillers, pausing correctly) required a different kind of prompting than text-based chat.
Accomplishments that we're proud of
- True Real-Time Conversation: We achieved a conversational flow that feels human. The "turn-taking" latency is practically non-existent.
- Full Stack in Record Time: We built a production-ready application—complete with auth, database, frontend, and AI pipelines—in a fraction of the time it would normally take, thanks to Gemini 3 Pro.
- The "Vibe": It just feels cool to use. The first time the AI laughed at a joke we made during a test call, we knew we had something special.
What we learned
- Multimodal > Chained Models: The future of voice agents is native multimodal models. Chaining STT -> LLM -> TTS is dead technology; the latency penalty is just too high.
- Development is changing: Using high-reasoning models like Gemini 3 Pro allows developers to focus purely on architecture and user experience, while the AI handles the implementation details.
What's next for Sales Coach
- Custom Persona Generator: Allowing managers to clone their own voice and upload past call transcripts to generate "Super Customer" personas.
- Emotional Analytics: Using Gemini 2.5’s vision capabilities to analyze user facial expressions via webcam during the pitch.
- VR Integration: Moving the gym from the browser to a fully immersive VR boardroom.
Log in or sign up for Devpost to join the conversation.