Inspiration
I usually don't listen to music, so I don't carry headphones, but when I walk to the store or take my dog out or waiting on something, I want to listen to something useful. So I thought: why not build voice agents that argue with each other in real-time, and I could get insights from the friction instead of relying on one AI's perspective?
That question became PocketPanel.
What it does
PocketPanel runs a live voice debate/argument/podcast/rapid-crossfire between two AI agents on any topic you give it. You pick a subject, an agent classifies the intent then two Nova Sonic agents take opposing sides, and they argue back and forth, generating both speech and text simultaneously with natural prosody, emphasis, and rhythm. It sounds like two people disagreeing, not two scripts being narrated.
You can inject a question mid-debate. The next agent receives it, addresses it, and continues arguing. Brave Search grounds the agents on live sources when the topic demands real facts. Once the debate ends, a post-debate analysis breaks down both sides. Built for anyone with dead time and a question worth hearing from both sides
How I built it
The core architectural decision was treating Nova Sonic as a conversational agent, not a text-to-speech engine.
The first version followed the standard pattern: Nova Pro generates text, then Nova Sonic reads it aloud. That gave me ~18 second latency per turn and robotic, essay-like speech. Two robots reading essays at each other. I threw that pipeline out.
Each agent is now a Nova Sonic instance running on its own bidirectional WebSocket. It receives the opponent's last argument and generates its own rebuttal — text and audio together — with natural prosody and conviction. The orchestrator opens a fresh Nova Sonic session each turn, streams the opponent's last argument as the user prompt, and waits for completionEnd to signal the model is done. The previous turn's textOutput feeds directly into the next agent's input without another separate LLM call, no transcript reconciliation. The WebSocket closes after each turn and reopens for the next, keeping sessions stateless and independent. Time-to-first-audio dropped from 18 seconds to under one.
Three Nova models work in concert:
- Nova Lite — classifies the topic, assigns debate format (structured debate, podcast, explainer)
- Nova Sonic — two instances, each holding an opposing position, streaming audio and text via bidirectional WebSocket
- Nova Pro — synthesizes a structured post-debate analysis once the session ends
Client-side audio playback uses the Web Audio API with AudioContext and precise hardware-clock scheduling to play ~100ms WAV chunks with zero gap. All decodeAudioData calls are serialized through a promise chain to prevent race conditions on the shared playback timestamp.
Challenges I ran into
The silence problem. Nova Sonic needs a minimum amount of audio input to initialize its processing pipeline. Send too little and the model never responds — no error, just silence. I found the threshold (500ms of pre-text silence) empirically, because it isn't documented.
Event ordering is unforgiving. The bidirectional streaming protocol requires an exact event sequence:
sessionStart -> promptStart -> SYSTEM TEXT -> USER TEXT -> AUDIO -> promptEnd -> sessionEnd
One field out of place - a missing toolUseOutputConfiguration in promptStart, a wrong interactive flag and you get silence with no error. I cross-referenced three AWS reference implementations to discover my content blocks were nesting inside each other instead of opening and closing sequentially. The model silently choked on the interleaving.
Two agents, one voice. Both agents sounded identical despite different voiceId values. voiceId only governs output when Sonic is generating as itself. When it's reading pre-written text via SYSTEM_SPEECH, it uses the default voice regardless. Once I gave each agent its own session with its own persona and let Sonic generate its own words, they found distinct voices.
Browser audio stuttering. Streaming WAV chunks through HTMLAudioElement produced audible gaps, each element has its own decoder startup overhead. Replacing it with Web Audio API's AudioContext eliminated gaps, but then concurrent decodeAudioData calls raced on the shared playback timestamp, causing overlapping or silent chunks. Serializing decodes through a promise chain fixed it.
Safety false positives. Agent B occasionally refuses to engage, claiming it can't "mimic or impersonate another voice." Nova's safety layer misinterprets the opponent's quoted argument as a request to roleplay rather than counter-argue. This is an open problem — prompt reframing from "respond to what they said" to "counter-argue this position" reduces frequency but hasn't fully resolved it.
Accomplishments that I'm proud of
Sub-second time-to-first-audio on a pipeline that started at 18 seconds. The agents sound like they believe what they're saying — contractions, emphasis, pacing, conviction — not like text readers. Moderator injection works mid-debate without breaking flow. The entire Sonic-as-agent architecture is gated behind a single environment variable, so I can instantly revert to the old pipeline if something breaks. Gap-free audio playback on a live stream of 100ms chunks in the browser.
What I learned
Nova Sonic is not a TTS model. It's a mind with a voice. The moment I stopped treating it as a text reader and gave it a persona, a position, and an opponent, the output transformed. The latency improvement was a side effect, the real gain was that the agents stopped performing and started arguing.
Bidirectional streaming over WebSockets is a coordination problem more than a coding problem. Session management, event sequencing, audio chunking, client-side playback scheduling - each piece is manageable alone. Stitching them together without latency or race conditions killing the experience is where the engineering lives.
What's next for PocketPanel
The immediate unlock is custom voices, letting users debate through agents that sound like people they actually want to listen to. Beyond that: domain-specific agents fine-tuned for law, medicine, finance. Multi-language debates. Audience voting. The architecture is already multi-agent — scaling is an orchestration problem, not a rewrite.
Built With
- amazon-bedrock
- amazon-nova-lite
- amazon-nova-pro
- amazon-nova-sonic
- brave-search-api
- next.js
- node.js
- railway
- tailwind
- tailwind-css
- typescript
- web-audio-api
- websocket
Log in or sign up for Devpost to join the conversation.