Voice-less Voice Agent

Inspiration

Voice agents assume you can speak out loud. That fails in libraries, meetings, and anywhere you need to stay quiet. We built a voice-less voice agent: mouth a command to your webcam, and a browser agent executes it silently.

What it does

Hold the record button while mouthing a short phrase → our 250M parameter lip-reading model decodes it → you review the text → a BrowserBase + Stagehand agent runs the command in a live embedded browser.

Example: mouth/whisper "search up images of cats" → the agent opens Google and searches for cats and moves to the image tab, no audio required whatsoever.

The main use case for this project is people who like using voice dictation/agents like Wispr Flow but are unable to use them in public. This enables private, silent, and fast communication with computers.

How we built it

Lip-reading model (VSR)

Recorded ~1 hour of single-speaker training video
Used MediaPipe mouth segmentation to crop lip regions from each frame
Applied data augmentation (lighting, noise, etc.) for robustness across real-world conditions
Trained the model on an AWS A100 GPU for a couple of hours (~17.6% WER on held-out data)
Deployed to a HuggingFace Space with GPU inference for low-latency decode

App

Next.js UI: live browser (left), silent input / transcript / agent log (right)
Stagehand v3 on BrowserBase with Claude as the agent LLM (ANTHROPIC_API_KEY + STAGEHAND_MODEL)
Persistent cloud browser session with embedded live view; SSE streaming for agent logs
Built and iterated with Claude Code!

Challenges we ran into

Stagehand model/API key mismatches caused opaque failures; separating BrowserBase vs Anthropic keys was essential
Keeping the BrowserBase session alive across commands required keepAlive and session reuse
VSR on CPU took 10–30s per clip; GPU deployment cut that to a few seconds

Accomplishments that we're proud of

Full pipeline: silent lips → text → real browser action in one demo
Custom fine-tuned VSR, not off-the-shelf speech-to-text
Claude-powered browser agent + Claude Code for rapid development
Embedded live browser view judges can watch in real time
Polished UX: push-to-talk, fixed layout, auto-scrolling logs

What we learned

VSR and browser agents compose cleanly when each layer has a clear contract (video → text → instruction)
Human-in-the-loop transcript editing matters when lip-reading isn't perfect
Live iframe + session persistence matter as much as the models for a convincing demo

What's next for Voice-less Voice Agent

Hands-free mode — auto-detect lip motion to start/stop recording
Accessibility focus — silent control for users who can't or shouldn't speak aloud
Multi-step memory — chain commands in one browser session
Mobile — front-camera lip reading on the go