Inspiration
We kept running into the same frustrating workflow, see something confusing on screen, take a screenshot, open a new tab, drag it into ChatGPT or Claude, type the question, wait, then switch back. Half the time the screenshots pile up on the desktop. The other half we lose context switching between tabs.
We thought: why can't we just talk to the screen and get an answer right there?
## What it does
ScreenSense Voice is a Chrome extension that lets you hold a key, speak a question about anything on your screen, and get an AI-powered answer instantly, overlaid right on the page you're looking at.
- Screen-aware - automatically captures what's on your screen when you ask
- Voice-first - hold the backtick key, speak naturally, release to get your answer
- Multiple display modes - text + audio, audio only, or text only
- Explanation levels - tailor responses from Kid to Executive level
- Conversation memory - up to 20 follow-up turns per tab with full context
## How we built it
- Chrome Extension (Manifest V3) as the platform, built with TypeScript and Webpack
- Groq Whisper API (
whisper-large-v3-turbo) for fast, accurate speech-to-text - Groq Vision API with Llama 4 Scout to understand screenshots and generate streaming responses
- ElevenLabs API for natural-sounding voice summaries (with Web Speech API fallback)
- Shadow DOM for complete UI isolation - the overlay never breaks the host page's styles
- React 18 for the settings and onboarding pages
- Offscreen Document API for microphone access in Manifest V3
The architecture follows a pipeline: content script captures the shortcut → service worker orchestrates screenshot + transcription + AI response + TTS → content script renders the streamed result in real-time.
## Challenges we ran into
- Manifest V3 microphone restrictions - MV3 doesn't allow direct microphone access from service workers. We had to use the Offscreen Document API to handle audio recording in a separate hidden document.
- Streaming responses - getting the AI response to stream word-by-word into the overlay while simultaneously generating a TTS summary required careful coordination between the service worker and content script.
- Shadow DOM styling - isolating the overlay styles completely from any host page while keeping it responsive and looking consistent everywhere.
- Audio amplitude visualization - piping real-time audio levels from the offscreen document back to the content script for the animated waveform indicator.
## Accomplishments we're proud of
- Zero context switching - the answer appears right where you're looking
- Works on every website with no configuration per site
- The entire product runs on free API tiers - no credit card required
- Smart 3-second audio summaries instead of reading the entire response aloud
## What we learned
- The Offscreen Document API is powerful but tricky to work with in MV3
- Shadow DOM is essential for building Chrome extension UIs that don't conflict with host pages
- Streaming AI responses create a dramatically better UX than waiting for complete responses
- Voice interaction reduces friction far more than we expected - once you try it, typing questions feels slow
## What's next for ScreenSense Voice
- Multi-model support (GPT-4o, Claude) for users who prefer different AI providers
- Persistent conversation history across tabs and sessions
- Custom shortcut key configuration
- Chrome Web Store publishing for one-click installation
Built With
- chrome
- elevenlabs
- groq
- llama-4-scout
- manifest-v3
- react
- shadow-dom
- speech
- typescript
- web
- webpack
- whisper
Log in or sign up for Devpost to join the conversation.