Inspiration
Growing up, I always loved playing games like Dungeons & Dragons, Secret Hitler, and Werewolf with friends and family. But there was always the same problem — nobody wanted to be the narrator. Someone had to sit out, manage the game, keep track of night actions, moderate discussions, and never make a mistake. The best player at the table had to sacrifice their fun so everyone else could play.
When I saw that this competition had a category for Live Voice Agents, it clicked immediately. The Gemini Live API offered something no other API could: real-time bidirectional voice streaming with low enough latency for a live social game. I thought it would be the perfect opportunity to showcase a solution to the narrator problem — not with text prompts or pre-recorded audio, but with an AI that actually speaks, listens to players argue in real time, and reacts dynamically to what's happening in the game.
Then I had the idea that made it truly interesting: what if the narrator wasn't just moderating — what if it was secretly controlling one of the characters? Can an AI participate in a social deduction game convincingly enough that human players can't tell which character is real?
What it does
Fireside: Betrayal is a browser-based, voice-first social deduction game for 3-8 players. An AI narrator leads the game using natural spoken voice while one (or two) characters in the game are secretly controlled by AI.
How a game plays:
- Share a 6-character join code. Everyone joins on their phone — no app download, no account creation.
- The host picks a narrator voice (Classic village elder, Campfire storyteller, Horror observer, or Comedy announcer) with live audio previews.
- Roles are dealt from a pool of 8: Villager, Seer, Healer, Hunter, Bodyguard, Tanner, Drunk, and Shapeshifter. The AI hides among the cast.
- Each night, the Shapeshifter hunts. The Seer investigates. The Healer protects. At dawn, the narrator describes what happened — in voice.
- During discussion, players hold a push-to-talk button to speak. The narrator hears each player, identifies them by character name, relays their arguments to the group, and stirs the pot. Mention an AI character by name and they respond automatically.
- Dead players aren't eliminated from the experience — they enter the Ghost Council, send one-word spectral clues, and get a dramatic Seance phase (45 seconds of push-to-talk testimony from beyond the grave) when half the village is dead.
- After the game, a post-game timeline reveals exactly what the AI was thinking every round — every lie, every deflection, every strategic calculation.
Unique mechanics:
- The Drunk role is disguised as Seer — the player sees a full investigation UI but gets wrong results. They don't learn the truth until the post-game reveal.
- On Normal/Hard difficulty, any character can be AI-controlled, and a human might draw the Shapeshifter role — performing kills through the game UI.
- An optional in-person camera mode uses Gemini Vision to count raised hands for physical gatherings.
How we built it
The system runs 6 AI agents coordinated through a FastAPI backend on Cloud Run:
Narrator Agent (gemini-2.5-flash-native-audio-latest via Live API) — 1,646 lines of Python managing a persistent bidirectional voice WebSocket. Handles speaker identification from player mic audio, real-time argument relay, active moderation (stir/challenge/redirect), phase transitions via tool calls, and transcript buffering with 0.8s debounce. Four voice presets map to distinct Gemini voices (Gacrux, Sulafat, Enceladus, Zubenelgenubi).
AI Character Agent (gemini-3-flash-preview, text-only) — Stateless agent handling night target selection, day vote decisions, and dialog generation. Difficulty-calibrated prompts tune deception ability: Easy (catch rate ~70%), Normal (~50%), Hard (multi-round deception arcs).
Scene Agent (gemini-3.1-flash-image-preview) — Generates atmospheric scene backgrounds pre-cached during lobby for instant game start. Images crossfade at 18% opacity on phase transitions.
Camera Vote Agent (gemini-3-flash-preview vision) — Counts raised hands from the host's camera feed for in-person play.
Audio Recorder — Captures narrator audio highlights for the post-game reel.
Strategy Logger — Records every AI decision with reasoning for the post-game timeline reveal.
Backend architecture (9,000+ lines Python):
- Modular
ws/package split into 8 focused modules (connection, send queues, game lifecycle, message handlers, night resolution, phase timers, state machine, vote manager) - Per-player send queues with sequence numbers and message replay for reliable delivery on reconnect
- Phase transition state machine with enforcement guards preventing race conditions
- Firestore transactions for atomic game state updates
- Rate limiting, CORS validation, and input sanitization
- Narrator watchdog that auto-restarts dead Gemini sessions
Frontend (5,000+ lines React):
- Three-tab interface: Story | Investigation Journal | Vote Records
- Push-to-talk with auto-release after 30 seconds
- Phase transition animations (dark/dawn/judgment/dusk color overlays)
- 11-step interactive tutorial with spotlight overlay
- Visibility API reconnect for mobile browser tab suspension
Infrastructure:
- Terraform IaC for Cloud Run, Firestore, Artifact Registry
- Cloud Build CI/CD pipeline with multi-stage Docker (Node frontend build + Python backend)
- Single container serving both frontend static assets and backend API
Challenges we ran into
Gemini Live API turn-taking. The VAD (Voice Activity Detection) in Live API doesn't always detect end-of-speech cleanly, especially with short player utterances. We needed explicit end-of-speech signal handling and transcript debouncing (0.8s buffer) to prevent the narrator from interrupting players mid-sentence or missing short statements entirely.
Mobile WebSocket suspension. Mobile browsers aggressively suspend WebSocket connections when the tab loses focus — which happens constantly on phones when players switch to check a text message. We implemented Visibility API detection with automatic WebSocket reconnection and per-player message queues with sequence-based replay so no game state is lost during brief disconnections.
Cloud Run idle stream termination. Cloud Run's Envoy proxy kills idle HTTP/2 streams, which silently terminated our Gemini Live API sessions during quiet game moments (e.g., night phases where the narrator waits). We added application-level keepalive pings on both the player WebSocket and the Gemini session to prevent this.
Single-session voice mediation. Social deduction requires all players to hear each other, but Gemini Live API accepts one audio stream per session. We chose mediated voice: players speak to the narrator, who relays arguments to the group. This creates a unique gameplay dynamic — the narrator paraphrases, editorializes, and stirs conflict — but it's a fundamental trade-off versus direct player-to-player voice.
AI image file size unpredictability. Gemini's image generation model produces variable file sizes. A 2.3MB scene image would break our loading budget. We couldn't control output size through prompting alone — switching to a flat vector art style instruction brought sizes to a consistent range that loads fast on mobile.
Narrator session stability. The Gemini Live API session can silently die without a clean close frame, leaving the game narrator-less. We built a watchdog that monitors narrator health and auto-restarts dead sessions with context compression (feeding the session a summary of game events so far rather than replaying the full history).
Accomplishments that we're proud of
The AI actually deceives people. During playtesting, players couldn't reliably distinguish AI characters from human ones. The difficulty-calibrated prompts create meaningfully different experiences — Easy AI makes suspicious "mistakes" on purpose, Hard AI builds multi-round deception arcs.
The post-game timeline is addictive. Seeing "Night 2: Quietly redirected suspicion toward Rowan" next to your own memory of that round creates an emotional moment that makes people want to play again. One playtester said it was "the best part of the game."
Death isn't boring. The Ghost Council, haunt actions, spectral clues, and Seance phase keep eliminated players engaged. Dead AI characters even generate atmospheric ghost dialog (~30% chance per round).
Zero-friction setup. No download, no account, no OAuth. Share a 6-character code, join on your phone, playing in under 60 seconds.
What we learned
Voice AI in multiplayer is a different beast than single-user. Most Gemini Live API demos are 1:1 conversations. Adding speaker identification, turn mediation for multiple players, and real-time game state awareness to a voice session required a completely different architecture than a simple chat agent.
Prompt engineering has limits for non-text outputs. We couldn't reliably control image file size or audio pacing through prompts alone. Architectural solutions (flat vector style for images, application-level timers for pacing, watchdog for session health) were necessary complements to prompt tuning.
Mobile web is hostile to real-time apps. Between WebSocket suspension, audio context restrictions, inconsistent push-to-talk behavior across browsers, and touch event quirks, making a voice-first game work reliably on mobile phones required as much engineering effort as the AI integration itself.
Social deduction games need mediated information flow. The narrator-as-relay architecture initially felt like a limitation, but it became a feature. The narrator paraphrasing, editorializing, and challenging player arguments creates a more theatrical experience than direct voice chat would. Players perform for the narrator, not just for each other.
The reveal is the product. The post-game timeline — showing what the AI was secretly thinking — generates more emotional response than the gameplay itself. Players who lose badly still have a great experience because discovering the AI's strategy is inherently satisfying. Design the ending first.
What's next
Direct voice interaction. Exploring architectures where players can speak directly to each other while the narrator still moderates — potentially using multiple Gemini sessions or an audio mixing layer.
Cross-game AI learning. The foundation exists for AI characters to learn from past games (strategy patterns that got caught, successful deception techniques). After 20+ games, the AI should adapt to your group's play style.
More roles and game modes. The role system is modular — adding new night actions and win conditions is straightforward. We're designing a Traitor mode (AI character that converts villagers) and a Diplomat role (immune to night kills but can't vote).
Spectator streaming. Let audiences watch live games with the narrator's voice and a spectator-safe view (no role reveals). Built for content creators and watch parties.
Persistent game history. Track your record across games — win rate, times fooled by AI, times you caught the AI, best arguments, most dramatic moments.
Built With
- api
- artifact-registry
- audioworklet
- cloud-build
- cloud-firestore
- docker
- fastapi
- gemini-flash
- gemini-live-api
- google-cloud-run
- google-genai-sdk
- javascript
- python
- react
- terraform
- vite
- websocket
Log in or sign up for Devpost to join the conversation.