VoxSight

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for VoxSight

Inspiration

Web accessibility remains a major challenge. Over 1 billion people worldwide live with disabilities, yet most websites fail to meet basic accessibility standards. Traditional screen readers require extensive training and only work with properly tagged HTML. We wanted to build something that works on ANY website, regardless of how it's coded -- using AI vision to understand what's actually on screen.

What it does

VoxSight is a Chrome extension that lets you browse the web entirely by voice. It captures screenshots of the current page, sends them to Google's Gemini multimodal AI, and executes actions based on your spoken commands:

"Describe this page" -- Get a spoken summary of page layout, content, and interactive elements
"Click the search box" -- AI identifies the element and clicks it with visual highlighting
"Type hello world" -- Types text into the focused input field
"Scroll down" -- Scrolls the page
"Find accessibility issues" -- AI analyzes contrast, missing alt text, and keyboard navigation problems

Every action is highlighted with a visual overlay so you can see exactly what's happening.

How we built it

Architecture:

Chrome Extension (Manifest V3): Side panel UI with WebSocket client, voice I/O via Web Speech API, and screenshot capture pipeline
Backend (Node.js on Cloud Run): WebSocket server that bridges the extension to Gemini's Live API with bidirectional streaming
Gemini 2.5 Flash Native Audio: Real-time voice responses with function calling for page actions

Key technical decisions:

Screenshot-based analysis (works on ANY website without DOM access)
Gemini Live API for real-time bidirectional streaming (not request/response)
Content script injection for precise action execution with coordinate mapping
Native audio model for natural-sounding voice responses

Challenges we ran into

Gemini's native audio model sometimes hallucinates page content when screenshots have small text
Coordinate mapping between screenshot pixel space and CSS pixels across different device pixel ratios
Chrome's captureVisibleTab API cannot capture chrome:// or Web Store pages
WebSocket connection management in Chrome's MV3 architecture (service workers die after 30s idle)
Tuning post-action delays: too short and the verification screenshot captures the old state, too long and the UX feels sluggish

Accomplishments that we're proud of

Works on ANY website with zero configuration
Real-time voice interaction with streaming responses (not turn-based)
Bilingual support (English and Chinese) with automatic language detection
Full accessibility suite: high contrast mode, adjustable fonts, keyboard shortcuts
Deployed to Google Cloud Run with production WebSocket support

What we learned

Gemini's Live API with native audio provides a much more natural interaction than text-to-speech
Screenshot-based UI understanding is surprisingly effective for web navigation
Building accessible tools requires thinking about accessibility at every layer, from HTML semantics to voice output

What's next for VoxSight

Improve screenshot OCR accuracy for PDF and small-text content
Add support for multi-step task automation ("Book me a flight to NYC")
Integrate with Chrome's built-in accessibility tree for hybrid analysis
Support more action types: drag-and-drop, file upload, multi-select

Built With

chrome
cloud-run
esbuild
gemini
gemini-api
google
node.js
typescript
web-speech-api
websocket

Updates

Cadler Build started this project — Mar 15, 2026 10:13 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.