AI Navigator - Accessibility Assistant

Extension overlay in action
Setup page preview (opens on extension install)

Inspiration

While exploring Chrome's new Built-in AI APIs, I realized that Gemini Nano's multimodal capabilities could revolutionize web accessibility. Traditional screen readers are limited to text-only understanding, but modern websites are increasingly visual—with buttons styled as images, content in charts, and layouts that rely on visual hierarchy.

I wanted to build something that could see the web like a sighted person does, understand it contextually, and take actions autonomously. The goal: enable someone to simply say "click the login button" and have the AI find and click it, just like asking a friend for help.

What it does

AI Navigator is a Chrome extension that combines vision, voice, and action to make web browsing accessible:

👁️ Sees pages through screenshot capture and Gemini Nano's vision API
🎤 Listens via Whisper speech-to-text (running in-browser with WebGPU)
🤖 Understands content using Gemini Nano for summarization and Q&A
🖱️ Acts autonomously by clicking buttons, links, and forms based on voice commands
🔊 Speaks naturally with Kokoro TTS (7 high-quality voices)

All processing happens 100% on-device—no cloud APIs, no data sharing, complete privacy.

How we built it

Architecture

Built as a Chrome Extension (Manifest V3) with multiple components:

Content Script - Injected into web pages, extracts content, tags interactive elements, executes clicks
Background Service Worker - Coordinates AI services, manages models, handles screenshot capture
Offscreen Document - Enables microphone access for voice input
Popup & Setup Pages - User interface for configuration and quick actions

AI Pipeline

Voice Input → MediaRecorder captures audio → Resample to 16kHz
Transcription → Whisper STT model (distil-whisper-tiny) via Transformers.js
Visual Context → chrome.tabs.captureVisibleTab captures PNG screenshot
Element Detection → Scan DOM for 50+ interactive element types, assign unique IDs
Multimodal AI → Send text + image to Gemini Nano's vision API
Command Parsing → Extract <<click:elem_id>> commands from AI response
Action Execution → Find element, highlight, scroll, click
Speech Output → Sentence-by-sentence streaming TTS with Kokoro (parallel generation)

Key Technical Achievements

Multimodal prompting - Successfully combined screenshot images with text context
Streaming TTS - Parallel sentence generation reduces latency by 70-80%
Smart element detection - Filters visible elements, handles aria-labels and shadow DOM
Command injection - AI embeds executable commands in natural language responses
WebGPU acceleration - Uses fp32 precision for highest quality TTS

Challenges we ran into

1. Gemini Nano Multimodal API Documentation

The vision API is bleeding-edge with limited documentation. Through experimentation, I discovered:

Images must be File objects, not base64 strings
Must use append() with specific content structure: [{role: 'user', content: [{type: 'text'}, {type: 'image'}]}]
Vision capability requires expectedInputs: [{ type: 'image' }] during session creation

2. Audio Streaming for Real-Time TTS

Initial approach generated full response before speaking (10+ second delay). Solution:

Split streaming AI responses into sentences
Generate TTS for each sentence in parallel (non-blocking promises)
Queue audio chunks and play sequentially
Result: First words spoken in ~2 seconds vs 10+ seconds

3. Action Execution Reliability

Making AI consistently output correct element IDs was tricky:

Solution: Provide clear context with element descriptions
Use distinctive syntax: <<click:elem_id>>
Include element text, type, and position in the map
Regex parsing: /<<click:(elem_\d+)>>/g

4. Chrome Extension Manifest V3 Limitations

Service workers can't use getUserMedia() directly → Created offscreen document
Audio context requires user gesture → Lazy initialization
Storage limits for large models → Used unlimitedStorage permission

5. Cross-Browser Model Loading

Transformers.js models behave differently across backends:

WebGPU: Best quality but limited browser support
WASM: Slower but universal compatibility
Implemented fallback chain: WebGPU → WASM → Chrome TTS

What we learned

Multimodal AI is incredibly powerful for accessibility when you combine vision + language
On-device AI is ready for production - Gemini Nano, Whisper, and Kokoro all run smoothly
Streaming architectures matter - Parallel processing transforms user experience
Web Audio API is complex - Resampling, timing, and context management require careful handling
Privacy-first AI is possible - No need for cloud APIs when browser AI is this capable