PlayEarOne
No controller? No problem.
A voice-controlled gaming platform where your voice is the controller. Shout "UP!" to move your paddle up. Yell "DOWN!" to go down. The louder you shout, the faster you move.
Inspiration
We wanted to make gaming more accessible and social. What if you didn't need a controller, keyboard, or even your hands to play? What if the entire room could become the game — friends shouting commands, competing with just their voices?
We were also inspired by party games and the chaos of couch co-op. Voice adds a layer of physicality and hilarity that buttons can't match.
What it does
PlayEarOne lets two players control a game using only their voices. The system:
- Recognizes who is speaking — Each player enrolls their voice, and the system learns to distinguish between them in real-time
- Transcribes commands locally — No cloud, no latency, no privacy concerns
- Maps voice to game input — Say "up" or "down" to control your paddle in Pong
- Responds to volume — Shout louder to move faster; whisper for precision
- Works in 1v1 mode — Uses discriminative speaker identification that focuses only on the features that make the two players' voices different
How we built it
Backend (Python/FastAPI):
- Vosk for 100% local speech-to-text transcription
- Resemblyzer for speaker embeddings (256-dimensional voice fingerprints)
- Discriminative 1v1 identification — Instead of comparing each voice to a threshold, we project onto the axis that separates the two players, focusing only on distinguishing features
- WebSocket streaming for real-time audio processing
- Parallel processing with ThreadPoolExecutor for simultaneous transcription and speaker ID
Frontend (JavaScript):
- ScriptProcessorNode for browser audio capture at 16kHz
- WebSocket client for streaming PCM audio to backend
- Canvas-based Pong game with voice-controlled paddles
- Volume meters and real-time command display
Audio Pipeline:
- Browser captures mic at 16kHz mono
- WebSocket streams 16-bit PCM to backend
- Buffer accumulates 0.7s chunks
- Parallel: Vosk transcribes + Resemblyzer identifies speaker
- Commands sent back to game via WebSocket
- Game simulates key presses with volume-based intensity
Challenges we ran into
- Short words are hard — "Up" is a 200ms utterance that gets lost in longer audio chunks. We tuned chunk sizes and added phonetic matching ("yup" → "up", "app" → "up")
- Speaker identification bias — Generic voices matched everything. We solved this with discriminative projection that only looks at features where the two players differ
- Latency vs accuracy tradeoff — Smaller chunks = faster response but worse transcription. Larger chunks = better accuracy but sluggish controls
- Browser audio APIs — AudioWorklet had issues; we reverted to ScriptProcessorNode for reliability
- No cloud allowed — We committed to 100% local processing, which meant no Whisper API, no Deepgram, just Vosk running on-device
Accomplishments that we're proud of
- Sub-500ms voice-to-action latency — Fast enough to feel responsive in a real-time game
- Discriminative speaker ID — Our 1v1 mode projects voices onto the axis separating two players, making identification robust even when voices are similar
- Zero cloud dependencies — Everything runs locally. Your voice never leaves your machine
- Volume-based intensity — Louder = faster movement adds a physical, competitive element
- It's actually fun — Watching two people shout at a Pong game is genuinely hilarious
What we learned
- Speaker identification is harder than speech recognition — similar voices, background noise, and short utterances all cause problems
- Discriminative approaches beat absolute thresholds — when you know there are exactly 2 options, compare them directly
- Audio processing is all about tradeoffs — latency, accuracy, and robustness are constantly in tension
- Browser audio APIs are a minefield — what works in theory often fails in practice
What's next for PlayEarOne
- More games — Boxing (jab! cross! dodge!), racing, rhythm games
- Adaptive enrollment — Continuously improve voice profiles during gameplay
- Spectator mode — Let the crowd influence the game with collective shouting
- Mobile support — Play from your phone, no app required
- Accessibility features — Voice gaming opens doors for players who can't use traditional controllers
- Tournament mode — Bracket-style competitions with voice-verified players
Built With
- javascript
- openrouter
- python
- resemblyzer
- vosk
- websocket
- websockets
Log in or sign up for Devpost to join the conversation.