Inspiration

Millions of blind and low-vision users struggle to navigate websites that were never built with them in mind. Existing screen readers announce raw page structure, "button", "link", "heading", without any understanding of context or meaning. We wanted to build something that actually understands a webpage the way a person does and communicates it naturally back to the user.

What it does

VoiceNav is a Chrome extension that acts as a browser agent helping blind users and people with other disabilities navigate any webpage using their voice. When a page loads, VoiceNav automatically describes what's on it in natural language far more useful than what traditional screen readers currently provide. Users can then speak a command like "open the article" or "go to my assignments" and VoiceNav figures out what they mean and executes the action directly on the page.

How we built it

We built VoiceNav with three main components. A Chrome extension that extracts a clean representation of any webpage, stripping noise while preserving meaningful structure, interactive elements, and CSS selectors. A FastAPI backend with routes for page description, voice command processing, and element readouts. We also utilized an LLM deployed on AMD Dev Cloud via vLLM, which handles all inference and returns responses. Voice input uses the Google Web Speech API, and output uses ElevenLabs API to deliver human-like responses back to the user. We used Qwen/Qwen3-VL-30B-A22B-Instruct model for the extension and the demo video. We noticed that OpenGVLab/InternVL3-38B works better overall but is slower in response time.

Challenges we ran into

A major challenge was figuring out how to get the agent to reliably interact with any webpage using only voice prompts. We also had to continuously fine-tune our LLM prompts to produce consistent, natural-sounding descriptions that made sense when read aloud to someone who cannot see the screen, getting the model to stop using words like "click" and "button" also was a problem. Another challenge was extracting and condensing webpage HTML into a format that was organized and easy for the LLM to parse, which led us to build a pipeline that reduced pages down to only what mattered while preserving the selectors needed to execute actions.

Accomplishments that we're proud of

We are proud that VoiceNav works end-to-end on real websites, not just a controlled demo environment. The page descriptions sound genuinely natural, more like a helpful friend than a robot reading code. We're also proud of successfully deploying and serving a model on AMD Dev Cloud with low enough latency for real-time use, and building a semantic extraction pipeline that compresses full pages by over 90% while keeping everything the model needs.

What we learned

We learned that prompt examples matter far more than prompt rules, showing the LLM what a good response looks like produced better results than telling it what not to do. Building for accessibility also forced us to think about communication differently, every word in a description has to earn its place.

What's next for VoiceNav

We want to expand VoiceNav to support full form filling with multi-step validation, add persistent user context so it remembers frequently visited sites, and explore ElevenLabs voice cloning so the assistant speaks in a familiar, personalized voice. Longer term we want to extend the same architecture to support users with motor disabilities, cognitive disabilities, and aphasia, the core pipeline of understanding a page and acting on it by voice applies far beyond blindness alone.

Built With

Share this project:

Updates