Inspiration

Browsers still work the same way they did 20 years ago: click, scroll, type, copy, paste, and constantly switch tabs.

For many people this workflow is fine. But for millions of users it creates friction and slows productivity. People frequently copy text into AI tools, take screenshots to ask questions, or manually navigate multiple pages just to complete a simple task.

We asked a simple question:

What if you could just talk to your browser and have it do things for you?

Instead of manually navigating websites, the browser could become an AI agent that understands your screen and executes tasks autonomously.

That idea became ScreenSense Voice.


What it does

ScreenSense Voice is a Chrome extension that turns your voice into browser actions.

You simply hold a key, speak a command, and an AI agent powered by Amazon Nova takes over your browser.

The system can:

  • Click buttons
  • Type into forms
  • Navigate across websites
  • Scroll through pages
  • Search products
  • Add items to carts
  • Complete multi-step workflows

All completely hands-free.

Example

User says:

"Add the cheapest USB-C cable to my cart on Amazon."

ScreenSense will:

  1. Navigate to Amazon
  2. Search for USB-C cables
  3. Analyze the results
  4. Find the cheapest product
  5. Click the product
  6. Add it to the cart

All without the user touching the mouse.


How we built it

ScreenSense Voice combines voice AI, multimodal reasoning, and browser automation into an autonomous agent system.

Core AI Services

Service Purpose
Amazon Nova 2 Lite (Bedrock) Multimodal reasoning using screenshots + DOM
Amazon Transcribe Streaming Real-time speech-to-text
Groq Whisper Speech recognition fallback
ElevenLabs Text-to-speech voice responses

Architecture

User holds ` key and speaks | v Content Scripts (voice trigger, DOM scraper, action executor) | v Service Worker (agent loop orchestration) | v FastAPI Backend (task routing + reasoning pipeline) | v AWS Cloud Services (Nova reasoning + Transcribe speech recognition)


Autonomous Agent Loop

ScreenSense operates through a perception → reasoning → action loop.

Voice command ↓ Capture screenshot + DOM ↓ Nova analyzes context ↓ AI decides next action ↓ Execute browser action ↓ Capture updated screen ↓ Repeat until task complete

The loop runs for up to 25 iterations, allowing the system to complete complex workflows.


Key Features

Voice-to-Action Control

Users simply hold a key and speak commands.

Speech is transcribed using Amazon Transcribe Streaming.


Multimodal Web Understanding

The AI receives both:

  • A screenshot of the webpage
  • A structured DOM snapshot

This combination dramatically improves browser automation accuracy.


Autonomous AI Agent

The system can perform multi-step tasks such as:

  • product searches
  • form filling
  • cross-site navigation
  • automated browsing workflows

Smart DOM Selector System

Modern websites generate unstable IDs.

We built a selector engine that prioritizes:

  • aria-label
  • data-* attributes
  • semantic tags
  • link patterns

This makes the automation much more reliable.


Adaptive Explanation Levels

Users can control how the AI communicates:

  • Kid
  • Student
  • College
  • PhD
  • Executive

The AI adjusts its explanation depth automatically.


Challenges we ran into

DOM selector reliability

Many websites generate dynamic IDs which break automation.

We solved this by prioritizing semantic attributes and accessibility labels instead of fragile selectors.


Agent loop termination

The AI needs to know when a task is truly complete.

We solved this through iterative observation and completion checks after each action.


Chrome Manifest V3 restrictions

Manifest V3 has strict security rules.

Microphone access required building an offscreen document system for secure audio recording.


Screenshot payload size

Screenshots can exceed 2MB.

We implemented:

  • JPEG compression
  • resolution scaling
  • optimized payload transmission

to stay within API limits.


What we learned

Multimodal reasoning is powerful

Using visual context + DOM structure together dramatically improves reliability.


Agent loops enable real automation

Single prompts fail on complex tasks.

Iterative reasoning enables true autonomous browser workflows.


Voice interfaces reduce friction

Voice-controlled agents could dramatically improve accessibility and productivity.


What's next

We plan to extend ScreenSense with:

  • Multi-tab workflows
    Example: "Compare prices for this item on Amazon and Walmart."

  • Nova Act integration
    for even more reliable UI automation.

  • Conversational browser feedback
    using Amazon Nova Sonic.

  • Complex workflow automation
    including travel bookings and job applications.


Built With

  • Amazon Nova (AWS Bedrock)
  • Amazon Transcribe
  • Groq Whisper
  • ElevenLabs
  • Chrome Extension (Manifest V3)
  • TypeScript
  • React
  • FastAPI
  • Python
  • Webpack

Links

Project Landing Page

GitHub Repository

Technical Article

Built for

Amazon Nova AI Hackathon 2026

Built With

  • amazon-nova-2-lite
  • amazon-transcribe
  • aws-bedrock
  • chrome-extension-(mv3)
  • fastapi
  • python
  • react
  • tailwind
  • typescript
  • webpack
Share this project:

Updates