VOICE UI NAVIGATOR

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for Untitled

Project Title: Voice UI Navigator

Tagline: Stop typing. Start seeing. Your AI that reads your screen and speaks your answers.

Summary of Features and Functionality

Voice UI Navigator is a next-generation AI agent built for the UI Navigator category of the Gemini Live Agent Challenge. It transforms how users interact with their browser by combining real-time voice conversation, Gemini multimodal vision, and live web search into one seamless experience.

The agent works in three steps:

You share your screen — attach a screenshot of any browser page or application You speak your question — ask anything about what you see or what you want to find The agent responds with voice — it analyzes the screen, searches the web if needed, and speaks results back to you in real time Unlike traditional screen readers or automation tools, the Voice UI Navigator does not rely on DOM access, JavaScript injection, or browser APIs. It reads the screen purely visually — exactly as a human would — making it universally compatible with any application, website, or desktop interface.

Key features:

Real-time voice conversation powered by the Gemini Live API Multimodal screen understanding using Gemini 2.0 Flash vision Live Google Search grounding for up-to-date research Natural interruption-friendly voice responses Zero DOM access — pure visual AI understanding Deployed on Google Cloud Run for global accessibility Technologies Used

Gemini 2.0 Flash Live (gemini-2.0-flash-live-001) — real-time bidirectional voice streaming via the Gemini Live API Gemini 2.0 Flash (gemini-2.0-flash) — multimodal vision model for screenshot analysis Google ADK (Agent Development Kit v1.25.1) — agent orchestration framework, tool execution, and /run_live WebSocket endpoint for voice Google Cloud Run — serverless container hosting for the deployed agent Google Cloud Build — automated Docker image building and pushing to Artifact Registry Google Artifact Registry — Docker image storage Python 3.11 — backend language FastAPI + uvicorn — web server (built into ADK) Google Cloud Services Used

Google Cloud Run (agent hosting) Google Cloud Build (image build and push pipeline) Google Artifact Registry (Docker image storage) Gemini API via Google AI Studio Data Sources

Live Google Search grounding via ADK's built-in google_search tool — provides real-time, up-to-date web results User-provided browser screenshots — no external datasets used; the agent works on any screen the user shares Findings and Learnings

Building this project surfaced several important insights:

ADK's directory structure is strict — the agent folder must be a direct child of the directory passed to adk web. Subdirectories inside the agent folder are scanned as potential agent packages, which caused early deployment failures when the tools/ subfolder was mistakenly treated as a separate agent.

Gemini Live API requires a specific model — only gemini-2.0-flash-live-001 supports the bidirectional audio streaming needed for real-time voice. Standard models like gemini-1.5-flash or gemini-2.0-flash do not work with the /run_live WebSocket endpoint.

Separating vision from live audio improves reliability — making a synchronous gemini-2.0-flash call inside the screenshot tool, rather than relying on the live session to handle vision, kept the voice stream stable and responsive.

Cloud Build permissions require explicit setup — the Compute Engine default service account used by Cloud Build does not have Artifact Registry write access by default. Granting roles/artifactregistry.writer to the correct service account was a necessary manual step.

Pure visual AI is more powerful than expected — Gemini's ability to understand UI layouts, read text, identify buttons and links, and suggest actions from a raw screenshot — without any DOM context — was impressively accurate and opens the door to truly universal screen assistants.

Public Code Repository

https://github.com/Kamaumbugua-dev/GEMINI_CODING_CHALLENGE

Spin-up instructions are included in the README. The project is fully reproducible locally with a Gemini API key.

Live Deployment

https://voice-navigator-913580598688.us-central1.run.app