Voice-less Voice Agent

Inspiration

Voice agents assume you can speak out loud. That fails in libraries, meetings, and anywhere you need to stay quiet. We built a voice-less voice agent: mouth a command to your webcam, and a browser agent executes it silently.

What it does

Hold the record button while mouthing a short phrase → our 250M parameter lip-reading model decodes it → you review the text → a BrowserBase + Stagehand agent runs the command in a live embedded browser.

Example: mouth/whisper "search up images of cats" → the agent opens Google and searches for cats and moves to the image tab, no audio required whatsoever.

The main use case for this project is people who like using voice dictation/agents like Wispr Flow but are unable to use them in public. This enables private, silent, and fast communication with computers.

How we built it

Lip-reading model (VSR)

  • Recorded ~1 hour of single-speaker training video
  • Used MediaPipe mouth segmentation to crop lip regions from each frame
  • Applied data augmentation (lighting, noise, etc.) for robustness across real-world conditions
  • Trained the model on an AWS A100 GPU for a couple of hours (~17.6% WER on held-out data)
  • Deployed to a HuggingFace Space with GPU inference for low-latency decode

App

  • Next.js UI: live browser (left), silent input / transcript / agent log (right)
  • Stagehand v3 on BrowserBase with Claude as the agent LLM (ANTHROPIC_API_KEY + STAGEHAND_MODEL)
  • Persistent cloud browser session with embedded live view; SSE streaming for agent logs
  • Built and iterated with Claude Code!

Challenges we ran into

  • Stagehand model/API key mismatches caused opaque failures; separating BrowserBase vs Anthropic keys was essential
  • Keeping the BrowserBase session alive across commands required keepAlive and session reuse
  • VSR on CPU took 10–30s per clip; GPU deployment cut that to a few seconds

Accomplishments that we're proud of

  • Full pipeline: silent lips → text → real browser action in one demo
  • Custom fine-tuned VSR, not off-the-shelf speech-to-text
  • Claude-powered browser agent + Claude Code for rapid development
  • Embedded live browser view judges can watch in real time
  • Polished UX: push-to-talk, fixed layout, auto-scrolling logs

What we learned

  • VSR and browser agents compose cleanly when each layer has a clear contract (video → text → instruction)
  • Human-in-the-loop transcript editing matters when lip-reading isn't perfect
  • Live iframe + session persistence matter as much as the models for a convincing demo

What's next for Voice-less Voice Agent

  • Hands-free mode — auto-detect lip motion to start/stop recording
  • Accessibility focus — silent control for users who can't or shouldn't speak aloud
  • Multi-step memory — chain commands in one browser session
  • Mobile — front-camera lip reading on the go

Built With

Share this project:

Updates