Inspiration
What it does
How we built it
Challenges we ran into
Accomplishments that we're proud of
What we learned
What's next for VoxSight
Inspiration
Web accessibility remains a major challenge. Over 1 billion people worldwide live with disabilities, yet most websites fail to meet basic accessibility standards. Traditional screen readers require extensive training and only work with properly tagged HTML. We wanted to build something that works on ANY website, regardless of how it's coded -- using AI vision to understand what's actually on screen.
What it does
VoxSight is a Chrome extension that lets you browse the web entirely by voice. It captures screenshots of the current page, sends them to Google's Gemini multimodal AI, and executes actions based on your spoken commands:
- "Describe this page" -- Get a spoken summary of page layout, content, and interactive elements
- "Click the search box" -- AI identifies the element and clicks it with visual highlighting
- "Type hello world" -- Types text into the focused input field
- "Scroll down" -- Scrolls the page
- "Find accessibility issues" -- AI analyzes contrast, missing alt text, and keyboard navigation problems
Every action is highlighted with a visual overlay so you can see exactly what's happening.
How we built it
Architecture:
- Chrome Extension (Manifest V3): Side panel UI with WebSocket client, voice I/O via Web Speech API, and screenshot capture pipeline
- Backend (Node.js on Cloud Run): WebSocket server that bridges the extension to Gemini's Live API with bidirectional streaming
- Gemini 2.5 Flash Native Audio: Real-time voice responses with function calling for page actions
Key technical decisions:
- Screenshot-based analysis (works on ANY website without DOM access)
- Gemini Live API for real-time bidirectional streaming (not request/response)
- Content script injection for precise action execution with coordinate mapping
- Native audio model for natural-sounding voice responses
Challenges we ran into
- Gemini's native audio model sometimes hallucinates page content when screenshots have small text
- Coordinate mapping between screenshot pixel space and CSS pixels across different device pixel ratios
- Chrome's
captureVisibleTabAPI cannot capture chrome:// or Web Store pages - WebSocket connection management in Chrome's MV3 architecture (service workers die after 30s idle)
- Tuning post-action delays: too short and the verification screenshot captures the old state, too long and the UX feels sluggish
Accomplishments that we're proud of
- Works on ANY website with zero configuration
- Real-time voice interaction with streaming responses (not turn-based)
- Bilingual support (English and Chinese) with automatic language detection
- Full accessibility suite: high contrast mode, adjustable fonts, keyboard shortcuts
- Deployed to Google Cloud Run with production WebSocket support
What we learned
- Gemini's Live API with native audio provides a much more natural interaction than text-to-speech
- Screenshot-based UI understanding is surprisingly effective for web navigation
- Building accessible tools requires thinking about accessibility at every layer, from HTML semantics to voice output
What's next for VoxSight
- Improve screenshot OCR accuracy for PDF and small-text content
- Add support for multi-step task automation ("Book me a flight to NYC")
- Integrate with Chrome's built-in accessibility tree for hybrid analysis
- Support more action types: drag-and-drop, file upload, multi-select
Built With
- chrome
- cloud-run
- esbuild
- gemini
- gemini-api
- node.js
- typescript
- web-speech-api
- websocket

Log in or sign up for Devpost to join the conversation.