Inspiration
Sales reps lose deals not because they don't know the product — they lose because they don't get enough reps. Roleplay practice is awkward to schedule, colleagues go easy, and there's no feedback loop. We wanted to build something a rep could open at 11pm before a big call the next morning and actually get better from.
What We Built
SalesCoach AI is a voice-first sales training simulator. A rep picks a scenario — prospect role, industry, mood, objection type, deal stage — and gets connected to a fully AI-generated prospect with a real name, company, and business pressures. They have a live voice conversation, and when the call ends they get AI coaching across six dimensions: discovery, objection handling, rapport, value communication, closing, and listening.
How We Used the Higgs Audio Models
The core pipeline is entirely voice-native:
- Boson HiggsAudioM3 (
higgs-audio-understanding-v3.5) receives the sales rep's audio directly and generates the prospect's response as text — combining ASR and LLM understanding in a single model call. This is what makes the conversation feel natural; there's no chunky transcribe-then-reason pipeline. - Eigen AI higgs2p5 speaks the prospect's response back via WebSocket streaming TTS, with gendered voices (Jack or Linda). Chunks stream in real time so the rep hears the response as it's generated rather than waiting for the full audio to render.
- Eigen AI higgs_asr_3 transcribes the rep's side of the call in parallel for the coaching and session history layers.
We also use gpt-oss-120b via Eigen AI for scenario generation and post-call coaching analysis.
Challenges
Getting the audio pipeline latency down. The naive approach — wait for ASR, wait for LLM, wait for TTS, then play — would mean 5+ seconds of silence after every turn. We solved this by running ASR and Boson AI concurrently, streaming TTS chunks over WebSocket as they arrive, and sending the prospect's response text to the frontend the moment the LLM returns — before TTS even starts. This brought perceived latency down to ~1.2 seconds.
Keeping the prospect in character. Audio understanding models follow complex roleplay instructions less reliably than pure LLMs. Early versions had the prospect accidentally pitching the product instead of being sold to. We fixed this with explicit identity anchoring in the system prompt and a strict role reminder injected on every turn.
Boson AI audio constraints. HiggsAudioM3 requires audio chunked to ≤4 seconds at 16kHz with indexed MIME types (audio/wav_0, audio/wav_1, ...). We built a resampling and chunking layer using scipy to handle this transparently, regardless of the browser's native sample rate.
What We Learned
Real-time voice AI is a different design space from text. Latency isn't just a performance metric — it's the difference between a conversation feeling alive and feeling like a dictation machine. Getting every layer of the pipeline to stream and parallelize was the most rewarding engineering challenge of the project.
Built With
- boson-ai-higgsaudiom3
- docker
- eigen-ai-higgs-asr-3
- eigen-ai-higgs2p5
- fastapi
- gpt-oss-120b
- next.js
- python
- scipy
- sqlite
- typescript
- websockets
Log in or sign up for Devpost to join the conversation.