Inspiration

Over 40 million are blind worldwide, and millions more have limited or no hand mobility due to injury, disease, or birth conditions. These individuals face a fundamental dilemma: obtaining access to the web, the gateway to information, communication, and opportunity, is annoyingly beyond reach using traditional input methods of a keyboard and mouse. Teresa, inspired by Mother Teresa's dedication to serving the disabled, sees web access differently thanks to the aid of voice assistants. We believe web browsing shouldn't require sight or hands—it should be universal.

What it Does

Teresa is a voice-controlled browser assistant that brings full web autonomy to individuals with visual or physical impairments. Users simply speak their requests through their phone, and Teresa handles everything—from asking clarifying questions to executing complex multi-step browser tasks with an open browser session.

Core capabilities include:

  • Intelligent task classification - Automatically determines whether requests need simple search or complex browser interaction
  • Interactive clarification - Asks follow-up questions to fully understand user intent before executing
  • Live browser automation - Performs real-time web actions like filling forms, clicking buttons, navigating pages, and extracting information
  • Conversational results - Delivers summaries through natural voice responses instead of visual outputs
  • Hands-free, eyes-free operation - Complete web browsing freedom without traditional input methods

Example: A user says "Find the best restaurants near Harvard Square." Teresa clarifies preferences (price range, cuisine), searches relevant sites, filters results, and delivers a curated spoken summary—all while the user never touches a device.

How We Built It

  • Voice Interface: Twilio Media Streams + FastAPI WebSocket server + Google Chrome Window
  • Speech Processing: OpenAI Realtime API (Whisper for transcription, GPT-4o for clarification, TTS-1 for synthesis)
  • Intelligence Layer: GPT-4o-mini for query classification and response simplification, Perplexity Sonar API for web search
  • Browser Automation: Custom parallel execution engine built on browser-use library with Playwright
  • AI Reasoning: Google Gemini 2.5 Flash for browser agent decision-making
  • Chrome Integration: Chrome DevTools Protocol (CDP) on port 9222 to connect to logged-in sessions
  • Deployment: Python backend running uvicorn twilio:app with ngrok tunnel for Twilio webhook

Teresa's twilio.py orchestrates a three-stage pipeline: phone audio arrives as 8kHz μ-law from Twilio, gets converted to 24kHz PCM16 for OpenAI Whisper transcription, then GPT-4o-mini cleans transcriptions and classifies tasks as either browser automation or web search. For browser tasks, browser_tasks.py deploys Gemini 2.5 Flash-powered Browseruse agents with custom reflection tools that connect to Chrome via CDP (preserving logged-in sessions), automatically parallelizing complex queries into 2-3 concurrent agents. Search queries route through Perplexity Sonar with GPT-4o-mini condensing responses to under 50 words, while a procedurally generated 440Hz thinking tone loops during processing, stopping only when OpenAI's TTS-1 converts the final response to speech and streams it back as 8kHz μ-law in 20ms chunks.

Challenges We Ran Into

Authenticated Session Persistence: Most browser automation frameworks launch isolated guest sessions without login credentials, making authenticated tasks like "check my Gmail" impossible without users providing passwords verbally. We configured Chrome with --remote-debugging-port=9222 to expose a Chrome DevTools Protocol endpoint, allowing Playwright agents to connect to an already-running browser with preserved authentication at ws://localhost:9222. After manually signing into key services once, all cookies and sessions persisted across automation tasks, with separate browser contexts preventing parallel agents from corrupting shared authentication.

Real-Time Response Speed for Voice Interactions: Browser automation inherently takes time—agents analyze pages, reason about actions, and wait for loads—creating awkward 90-120 second silences during phone calls that kill conversation flow. We were playing around with step limits, switching from GPT-4o to Gemini 2.5 Flash (3-4x faster per benchmarks), and implementing intelligent parallelization that splits queries like "compare prices" into 2-3 concurrent agents using asyncio.gather(). We significantly reduced response times through intelligent parallelization—complex queries that previously processed sequentially now run agents simultaneously

Bridging Incompatible Audio Sampling Standards: Teresa bridges three incompatible formats—Twilio's 8kHz μ-law, OpenAI's 24kHz PCM16, and back to 8kHz μ-law—requiring precise resampling chains that initially produced garbled audio and loaw transcription accuracy. We built a stateful conversion pipeline using Python's audioop module, preserving the rate_state variable between chunks to maintain phase coherence and prevent audible clicks from independent resampling. Processing audio in precisely timed 20ms chunks with asyncio.sleep(0.02) prevented buffer overruns, while a procedurally generated 440Hz "thinking tone" eliminated awkward silence during processing delays.

Managing Three Simultaneous Real-Time WebSocket Streams: Teresa coordinates three WebSocket connections—Twilio's async stream, OpenAI's synchronous thread-based client, and asyncio browser tasks—each operating on different threading models, causing audio desync and orphaned threads. We built a thread-safe bridge using asyncio.run_coroutine_threadsafe() and implemented an is_processing state machine that ignores new transcriptions while AI generates responses, calculating exact audio duration to block processing until TTS completes. For clean shutdown, a call_ended Event cancels in-flight browser tasks with 2-second timeouts, while a custom OpenAIRealtimeClient with dedicated sender threads prevents blocking the main event loop.

Accomplishments That We're Proud Of

  • Authenticated browser session integration - Successfully implemented CDP to connect agents to users' existing logged-in Chrome profiles, eliminating password entry and exponentially expanding the range of automated tasks (Gmail, Amazon, banking)
  • Production-ready voice-to-browser pipeline - Built a seamless system that bridges three simultaneous WebSocket connections (Twilio, OpenAI, browser automation) with complex audio format conversions while maintaining smoother transcription and conversation flow.
  • True hands-free accessibility - True hands-free accessibility - Achieved completely voice-driven web browsing over a phone call with intelligent task routing, parallel agent execution reducing wait times for complex queries, and natural conversation flow through thinking tones
  • Real-time performance optimization - Optimized browser automation through step limit tuning, Gemini 2.5 Flash integration, and automatic parallelization of complex queries

What We Learned

Parallel agent orchestration - Discovered that running browser tasks simultaneously rather than sequentially dramatically improved user experience, cutting wait time.

Real-time audio format bridging - Learned that naive resampling between 8kHz μ-law and 24kHz PCM16 destroys phase coherence, causing transcription failures. Maintaining stateful conversion pipelines with preserved rate_state variables was critical for production-quality voice processing.

Event loop integration across threading models - Managing three WebSocket connections on different threading paradigms taught us that asyncio.run_coroutine_threadsafe() and explicit state machines are essential for preventing race conditions and orphaned processes in real-time systems.

What's Next for Teresa

  • Making the conversations more smooth and conversational
  • One-click desktop application - Package as a standalone app with automatic Chrome integration, eliminating manual CDP configuration for non-technical users
  • Multi-language and accent support - Integrate language detection and region-specific TTS voices to serve non-English speakers and improve transcription accuracy across diverse accents
  • Partnership with accessibility organizations - Conduct user testing with blind and mobility-impaired individuals through partnerships with disability advocacy groups to validate real-world usability and gather feature priorities

Built With

Share this project:

Updates