Speaker Enrollment Tab: Enroll a new player's voice
Commands Tab: Test live commands to see real time speaker parsing
Game Arcade: access minigames
Minigame 1: volume-controlled Pong
Minigame 2: Boxing
Minigame 3: Headsoccer

PlayEarOne

No controller? No problem.

A voice-controlled gaming platform where your voice is the controller. Shout "UP!" to move your paddle up. Yell "DOWN!" to go down. The louder you shout, the faster you move.

Inspiration

We wanted to make gaming more accessible and social. What if you didn't need a controller, keyboard, or even your hands to play? What if the entire room could become the game — friends shouting commands, competing with just their voices?

We were also inspired by party games and the chaos of couch co-op. Voice adds a layer of physicality and hilarity that buttons can't match.

What it does

PlayEarOne lets two players control a game using only their voices. The system:

Recognizes who is speaking — Each player enrolls their voice, and the system learns to distinguish between them in real-time
Transcribes commands locally — No cloud, no latency, no privacy concerns
Maps voice to game input — Say "up" or "down" to control your paddle in Pong
Responds to volume — Shout louder to move faster; whisper for precision
Works in 1v1 mode — Uses discriminative speaker identification that focuses only on the features that make the two players' voices different

How we built it

Backend (Python/FastAPI):

Vosk for 100% local speech-to-text transcription
Resemblyzer for speaker embeddings (256-dimensional voice fingerprints)
Discriminative 1v1 identification — Instead of comparing each voice to a threshold, we project onto the axis that separates the two players, focusing only on distinguishing features
WebSocket streaming for real-time audio processing
Parallel processing with ThreadPoolExecutor for simultaneous transcription and speaker ID

Frontend (JavaScript):

ScriptProcessorNode for browser audio capture at 16kHz
WebSocket client for streaming PCM audio to backend
Canvas-based Pong game with voice-controlled paddles
Volume meters and real-time command display

Audio Pipeline:

Browser captures mic at 16kHz mono
WebSocket streams 16-bit PCM to backend
Buffer accumulates 0.7s chunks
Parallel: Vosk transcribes + Resemblyzer identifies speaker
Commands sent back to game via WebSocket
Game simulates key presses with volume-based intensity

Challenges we ran into

Short words are hard — "Up" is a 200ms utterance that gets lost in longer audio chunks. We tuned chunk sizes and added phonetic matching ("yup" → "up", "app" → "up")
Speaker identification bias — Generic voices matched everything. We solved this with discriminative projection that only looks at features where the two players differ
Latency vs accuracy tradeoff — Smaller chunks = faster response but worse transcription. Larger chunks = better accuracy but sluggish controls
Browser audio APIs — AudioWorklet had issues; we reverted to ScriptProcessorNode for reliability
No cloud allowed — We committed to 100% local processing, which meant no Whisper API, no Deepgram, just Vosk running on-device

Accomplishments that we're proud of

Sub-500ms voice-to-action latency — Fast enough to feel responsive in a real-time game
Discriminative speaker ID — Our 1v1 mode projects voices onto the axis separating two players, making identification robust even when voices are similar
Zero cloud dependencies — Everything runs locally. Your voice never leaves your machine
Volume-based intensity — Louder = faster movement adds a physical, competitive element
It's actually fun — Watching two people shout at a Pong game is genuinely hilarious

What we learned

Speaker identification is harder than speech recognition — similar voices, background noise, and short utterances all cause problems
Discriminative approaches beat absolute thresholds — when you know there are exactly 2 options, compare them directly
Audio processing is all about tradeoffs — latency, accuracy, and robustness are constantly in tension
Browser audio APIs are a minefield — what works in theory often fails in practice

What's next for PlayEarOne

More games — Boxing (jab! cross! dodge!), racing, rhythm games
Adaptive enrollment — Continuously improve voice profiles during gameplay
Spectator mode — Let the crowd influence the game with collective shouting
Mobile support — Play from your phone, no app required
Accessibility features — Voice gaming opens doors for players who can't use traditional controllers
Tournament mode — Bracket-style competitions with voice-verified players

Built With

javascript
openrouter
python
resemblyzer
vosk
websocket
websockets

Updates

Maya Pullara started this project — Feb 07, 2026 07:13 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.