Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that ## Inspiration

Every day, millions of people waste hours doing repetitive web tasks — searching prices, filling forms, copying data between sites, navigating complex software. Existing automation tools either require coding knowledge, break every time a website updates, or cost thousands of dollars.

I asked one question: What if anyone could just speak a task and watch it happen automatically — on any website, without any setup?

That's what VoiceNav is.


What It Does

VoiceNav is your AI hands on the internet. You speak any task naturally. The agent:

  1. Hears your command via Gemini Live API in real-time
  2. Sees your screen by analyzing screenshots with Gemini Vision
  3. Plans the exact steps needed to complete your task
  4. Executes real actions — clicks, scrolls, types — purely from visual understanding
  5. Narrates every step aloud so you always know what's happening
  6. Confirms before any irreversible action like checkout or form submission

Example commands that work on any website:

  • "Go to Amazon and find the best wireless headphones under $100"
  • "Search for flights from Mumbai to London in April under $800"
  • "Find remote Python developer jobs on LinkedIn posted this week"
  • "Open Gmail and find the latest invoice from my supplier"

No DOM access. No website-specific APIs. No coding required. Just speak and watch it happen.


How I Built It

Architecturewe're proud of

Voice Input (Gemini Live API) ↓ ADK Agent (Google Cloud Run) ↓ Screenshot → Gemini Vision ↓ Action Execution (CLICK / TYPE / SCROLL) ↓ Verify Result → Loop until done

The Agent Loop

The core of VoiceNav is an 8-step loop powered by Google ADK:

  1. LISTEN — Gemini Live API captures real-time voice
  2. PLAN — Break task into ordered sub-steps
  3. OBSERVE — Take screenshot, send to Gemini Vision
  4. DECIDE — Identify the next best action from what's visible
  5. ACT — Execute CLICK(x,y), TYPE(text), SCROLL, PRESS
  6. VERIFY — Take new screenshot, confirm action worked
  7. CONFIRM — Before irreversible actions, ask user
  8. DONE — Speak completion summary, show results

What Makes It Different

Unlike Selenium or RPA tools, VoiceNav uses pure visual understanding — just like a human would. It never touches the DOM or relies on website-specific APIs. This means it works on every website and never breaks when a site redesigns.


Challenges I Faced

Accurate coordinate mapping — Getting Gemini Vision to return precise pixel coordinates for UI elements required careful prompt engineering and a verification loop after every action.

Handling dynamic pages — Modern websites load content asynchronously. Teaching the agent to wait, detect loading states, and re-observe was critical for reliability.

Voice interruption handling — Implementing graceful mid-task interruptions with Gemini Live API required careful state management so the agent could pause cleanly without corrupting the task state.

Safe confirmation flow — Building the confirmation step for irreversible actions (purchases, form submissions) without making the experience feel clunky took several iterations to get right.


What I Learned

  • How to use Gemini Live API for true real-time bidirectional voice
  • How to build reliable agentic loops with Google ADK
  • How powerful Gemini Vision is for understanding arbitrary UI layouts
  • How to deploy scalable agent backends on Google Cloud Run
  • That the biggest challenge in web automation is not the actions — it's the verification and error recovery after each action

What's Next

  • Multi-tab and multi-window task support
  • Scheduled recurring tasks ("do this every Monday")
  • Mobile screen support via device streaming
  • Team workflows — share automation recipes with colleagues
  • Support for 20+ languages via Gemini's multilingual capabilities

What we learned

What's next for VoiceNav — Universal Web Automation Agent

Built With

Share this project:

Updates