Inspiration

While exploring Chrome's new Built-in AI APIs, I realized that Gemini Nano's multimodal capabilities could revolutionize web accessibility. Traditional screen readers are limited to text-only understanding, but modern websites are increasingly visual—with buttons styled as images, content in charts, and layouts that rely on visual hierarchy.

I wanted to build something that could see the web like a sighted person does, understand it contextually, and take actions autonomously. The goal: enable someone to simply say "click the login button" and have the AI find and click it, just like asking a friend for help.

What it does

AI Navigator is a Chrome extension that combines vision, voice, and action to make web browsing accessible:

  • 👁️ Sees pages through screenshot capture and Gemini Nano's vision API
  • 🎤 Listens via Whisper speech-to-text (running in-browser with WebGPU)
  • 🤖 Understands content using Gemini Nano for summarization and Q&A
  • 🖱️ Acts autonomously by clicking buttons, links, and forms based on voice commands
  • 🔊 Speaks naturally with Kokoro TTS (7 high-quality voices)

All processing happens 100% on-device—no cloud APIs, no data sharing, complete privacy.

How we built it

Architecture

Built as a Chrome Extension (Manifest V3) with multiple components:

  • Content Script - Injected into web pages, extracts content, tags interactive elements, executes clicks
  • Background Service Worker - Coordinates AI services, manages models, handles screenshot capture
  • Offscreen Document - Enables microphone access for voice input
  • Popup & Setup Pages - User interface for configuration and quick actions

AI Pipeline

  1. Voice Input → MediaRecorder captures audio → Resample to 16kHz
  2. Transcription → Whisper STT model (distil-whisper-tiny) via Transformers.js
  3. Visual Contextchrome.tabs.captureVisibleTab captures PNG screenshot
  4. Element Detection → Scan DOM for 50+ interactive element types, assign unique IDs
  5. Multimodal AI → Send text + image to Gemini Nano's vision API
  6. Command Parsing → Extract <<click:elem_id>> commands from AI response
  7. Action Execution → Find element, highlight, scroll, click
  8. Speech Output → Sentence-by-sentence streaming TTS with Kokoro (parallel generation)

Key Technical Achievements

  • Multimodal prompting - Successfully combined screenshot images with text context
  • Streaming TTS - Parallel sentence generation reduces latency by 70-80%
  • Smart element detection - Filters visible elements, handles aria-labels and shadow DOM
  • Command injection - AI embeds executable commands in natural language responses
  • WebGPU acceleration - Uses fp32 precision for highest quality TTS

Challenges we ran into

1. Gemini Nano Multimodal API Documentation

The vision API is bleeding-edge with limited documentation. Through experimentation, I discovered:

  • Images must be File objects, not base64 strings
  • Must use append() with specific content structure: [{role: 'user', content: [{type: 'text'}, {type: 'image'}]}]
  • Vision capability requires expectedInputs: [{ type: 'image' }] during session creation

2. Audio Streaming for Real-Time TTS

Initial approach generated full response before speaking (10+ second delay). Solution:

  • Split streaming AI responses into sentences
  • Generate TTS for each sentence in parallel (non-blocking promises)
  • Queue audio chunks and play sequentially
  • Result: First words spoken in ~2 seconds vs 10+ seconds

3. Action Execution Reliability

Making AI consistently output correct element IDs was tricky:

  • Solution: Provide clear context with element descriptions
  • Use distinctive syntax: <<click:elem_id>>
  • Include element text, type, and position in the map
  • Regex parsing: /<<click:(elem_\d+)>>/g

4. Chrome Extension Manifest V3 Limitations

  • Service workers can't use getUserMedia() directly → Created offscreen document
  • Audio context requires user gesture → Lazy initialization
  • Storage limits for large models → Used unlimitedStorage permission

5. Cross-Browser Model Loading

Transformers.js models behave differently across backends:

  • WebGPU: Best quality but limited browser support
  • WASM: Slower but universal compatibility
  • Implemented fallback chain: WebGPU → WASM → Chrome TTS

What we learned

  • Multimodal AI is incredibly powerful for accessibility when you combine vision + language
  • On-device AI is ready for production - Gemini Nano, Whisper, and Kokoro all run smoothly
  • Streaming architectures matter - Parallel processing transforms user experience
  • Web Audio API is complex - Resampling, timing, and context management require careful handling
  • Privacy-first AI is possible - No need for cloud APIs when browser AI is this capable

What's next

  • Form field input - Voice dictation to fill text fields
  • Page scrolling - "Scroll down", "Go to top" voice commands
  • Enhanced vision - Object detection, chart analysis, OCR for images
  • Multi-language support - TTS and STT in languages beyond English
  • Reading mode - Continuous article narration with playback controls

Built With

  • html5/css3
  • javascript
  • kokorotts
  • offscreen
  • scripting
  • storage
  • transformers.js
  • webgpu
  • whisperstt
Share this project:

Updates