VoiceAgent: Breaking the Scripting Wall
The Problem: Touch-Dependent Friction
Imagine you are 20 minutes into a recipe, hands caked in flour, and the video you are following just auto-played an ad. The "Skip Ad" button is right there on screen, but touching your laptop is out of the question.
This friction is everywhere:
- Kitchen & Home: Following a recipe or watching a tutorial with messy hands.
- Accessibility: Limited motor control makes a traditional mouse and keyboard setup painful or impossible to use.
- Multitasking: Washing dishes or working on a physical project while needing to interact with a screen.
Existing voice tools like Siri or Alexa only fire pre-scripted actions. They are "blind" to the context of your screen; they cannot find a "Skip Ad" button, scroll to "Step 4," or interact with any element that wasn't hardcoded in advance. We call this the Scripting Wall.
The Solution: VoiceAgent
VoiceAgent is a hands-free macOS assistant that lets you control your computer with natural speech. It doesn't just listen, it sees the screen and acts on it the same way a person sitting next to you would.
Key Capabilities
- Universal Control: Skip YouTube ads without touching your keyboard.
- Contextual Navigation: Scroll to "Step 4" on any website, regardless of API support.
- Web Automation: Fill out forms, click buttons, and navigate pages in Chrome.
- System Commands: Open apps and control system volume with simple voice prompts.
Technical Architecture
VoiceAgent is built as a layered pipeline where each stage hands off to the next. The core objective is to minimize total latency:
$$T_{\text{total}} = T_{\text{transcription}} + T_{\text{agent}} + T_{\text{action}}$$
1. Transcription
We use ElevenLabs Scribe v2 Realtime for streaming speech-to-text with built-in voice activity detection. It detects the end of speech and emits a finalized transcript in under a second.
2. The Agentic Loop
The transcript is processed by a browser-use Agent backed by Claude 3.5 Sonnet. The agent reads the live DOM, plans its actions (clicking, typing, or scrolling), verifies the result, and repeats until the task is complete.
3. The Warm Persistent Server
Our most critical engineering decision was maintaining a warm backend. By spinning up a dedicated Python subprocess running a JSON server on the first command, we eliminated the per-command browser startup cost:
$$T_{\text{cold}} = T_{\text{startup}} + T_{\text{task}} \quad \longrightarrow \quad T_{\text{warm}} = T_{\text{task}}$$
4. Native Overlay
A lightweight PyObjC Cocoa window floats over all applications, providing real-time feedback (Listening, Thinking, Working, Done) and displaying live transcripts.
Challenges & Pivots
- The Cold-Start Wall: Early versions suffered 8–12 second startup times. Solving the subprocess lifecycle and async locking for the "warm server" was the most difficult engineering challenge.
- Killing the "Darlings": We originally hand-rolled a 600-line Claude planner with 18 custom macOS tools. Realizing it was the wrong abstraction was difficult, but deleting it in favor of the cleaner
browser-usepath was the right move. - Utterance Segmentation: Tuning silence thresholds to distinguish between a mid-thought pause and a finished command required significant signal processing refinement.
What’s Next
- Fast-Path Routing: Bypassing the LLM for simple system commands (volume, scrolling) to achieve sub-300ms response times.
- Beyond the Browser: Utilizing the macOS Accessibility API and OCR to control native desktop apps like Finder or Slack.
- Contextual Memory: Implementing short-term history to allow for conversational commands like "do that again."
- Accessibility First: Pursuing VoiceAgent as a serious tool for the motor-impaired community.
Built With
- browser-use
- claude
- elevenlabs
- macos-accessibility-api
- pyautogui
- pyobjc
- python
- sounddevice
Log in or sign up for Devpost to join the conversation.