ScreenSense Voice

Project logo

Inspiration

We kept running into the same frustrating workflow, see something confusing on screen, take a screenshot, open a new tab, drag it into ChatGPT or Claude, type the question, wait, then switch back. Half the time the screenshots pile up on the desktop. The other half we lose context switching between tabs.

We thought: why can't we just talk to the screen and get an answer right there?

## What it does

ScreenSense Voice is a Chrome extension that lets you hold a key, speak a question about anything on your screen, and get an AI-powered answer instantly, overlaid right on the page you're looking at.

Screen-aware - automatically captures what's on your screen when you ask
Voice-first - hold the backtick key, speak naturally, release to get your answer
Multiple display modes - text + audio, audio only, or text only
Explanation levels - tailor responses from Kid to Executive level
Conversation memory - up to 20 follow-up turns per tab with full context

## How we built it

Chrome Extension (Manifest V3) as the platform, built with TypeScript and Webpack
Groq Whisper API (whisper-large-v3-turbo) for fast, accurate speech-to-text
Groq Vision API with Llama 4 Scout to understand screenshots and generate streaming responses
ElevenLabs API for natural-sounding voice summaries (with Web Speech API fallback)
Shadow DOM for complete UI isolation - the overlay never breaks the host page's styles
React 18 for the settings and onboarding pages
Offscreen Document API for microphone access in Manifest V3

The architecture follows a pipeline: content script captures the shortcut → service worker orchestrates screenshot + transcription + AI response + TTS → content script renders the streamed result in real-time.

## Challenges we ran into

Manifest V3 microphone restrictions - MV3 doesn't allow direct microphone access from service workers. We had to use the Offscreen Document API to handle audio recording in a separate hidden document.
Streaming responses - getting the AI response to stream word-by-word into the overlay while simultaneously generating a TTS summary required careful coordination between the service worker and content script.
Shadow DOM styling - isolating the overlay styles completely from any host page while keeping it responsive and looking consistent everywhere.
Audio amplitude visualization - piping real-time audio levels from the offscreen document back to the content script for the animated waveform indicator.

## Accomplishments we're proud of

Zero context switching - the answer appears right where you're looking
Works on every website with no configuration per site
The entire product runs on free API tiers - no credit card required
Smart 3-second audio summaries instead of reading the entire response aloud

## What we learned

The Offscreen Document API is powerful but tricky to work with in MV3
Shadow DOM is essential for building Chrome extension UIs that don't conflict with host pages
Streaming AI responses create a dramatically better UX than waiting for complete responses
Voice interaction reduces friction far more than we expected - once you try it, typing questions feels slow

## What's next for ScreenSense Voice

Multi-model support (GPT-4o, Claude) for users who prefer different AI providers
Persistent conversation history across tabs and sessions
Custom shortcut key configuration
Chrome Web Store publishing for one-click installation