🧭 Axis: Voice-Driven Browser Agent

Hackathon Category: UI Navigator ☸️
Agent Framework: Google Agent Development Kit (ADK) v0.3.0+
Google Cloud Services: Vertex AI · Cloud Run · Cloud Build · Firestore · Artifact Registry · Google OAuth 2.0

Gemini Models Used:

gemini-live-2.5-flash-native-audio - gemini-2.5-flash - gemini-2.5-flash-image

Backend: Hosted on Google Cloud Run (us-central1) — always on, no local server required.

Live Endpoint Health Check

Deploy to Cloud Run

Inspiration

Traditional automation tools and accessibility software struggle with the modern web. They depend on brittle selectors, rigid macros, and interfaces that break the moment a UI layout changes.

At the same time, millions of people with motor disabilities—including individuals living with ALS, Parkinson’s disease, or spinal cord injuries—face significant barriers when interacting with computers. Keyboard and mouse interaction can be exhausting, painful, or sometimes impossible.

The idea behind Axis was simple but powerful:

What if your browser could see the screen the way a human assistant does, understand your intent through natural conversation, and execute actions on your behalf?

By combining Gemini’s multimodal reasoning with real-time voice interaction, Axis turns the browser into a voice-driven agent that listens, observes the screen, understands UI context, and performs actions autonomously.

Instead of rigid automation scripts, the browser becomes a context-aware AI assistant.

What Axis Does

Axis is a voice-driven browser agent that acts as the user’s hands on the screen.

It observes the browser interface through screenshots and uses Gemini multimodal reasoning to interpret visual UI elements and determine the correct actions to perform.

Users can interact naturally with the browser:

  • “Scroll down and open the first article.”
  • “Search YouTube for AI tutorials.”
  • “Fill this form using the information in my uploaded document.”

Axis then translates the intent into executable browser actions such as:

  • Clicking
  • Typing
  • Navigating
  • Scrolling

Core Features

  • 🎙️ Live Voice Navigation — Talk to your browser. Axis sees the screen and acts — clicking, typing, scrolling, and navigating across sites through natural, interruptible voice commands.

  • 💬 AI Chat Mode — Full text-based agent with the same screenshot awareness and DOM execution as voice mode. Sessions are saved and searchable.

  • 👁️ Visual Screen Understanding — Uses real-time screenshots instead of DOM scrapers. Works on any website regardless of UI framework, shadow DOMs, or dynamic content.

  • 🖼️ AI Image Generation — Ask Axis to generate an image mid-conversation. Powered by gemini-2.5-flash-image. Download with one click.

  • 📁 File Upload — Drag and drop PDFs, images, or documents onto the session panel. Axis reads and reasons about them in context.

  • 🕘 Session History — Every session auto-saved to Firestore with a Gemini-generated headline summary, so you know where to get back.

  • ♿ Accessibility First — Built for users with motor disabilities, ALS, Parkinson's, or any condition that makes keyboard and mouse interaction difficult, and for users fascinated with AI Automation out of the chat box.

Tech Stack

Layer Technology
AI Models gemini-live-2.5-flash-native-audio · gemini-2.5-flash · gemini-2.5-flash-image
Agent Framework Google Agent Development Kit (ADK) v0.3.0+
AI SDK google-genai v0.8.0+
Backend Python · FastAPI · WebSockets · Asyncio
Frontend Chrome Extension (Manifest V3) · Vanilla JS · HTML · CSS
Database Google Firestore (Session & Transcript Storage)
Hosting Google Cloud Run (us-central1)
CI/CD Google Cloud Build · Artifact Registry
Auth Google OAuth 2.0 via chrome.identity
Infrastructure Terraform · cloudbuild.yaml
Rate Limiting Slowapi (Backend Protection)
Email SMTP (feedback delivery)
HTTP Client httpx (Async Tool Execution)

Learning Curve

🦾 AI & Multimodal Reasoning

  • Navigation Desync — Tracking explicit Window IDs is critical for agents operating in sidepanels to prevent the agent from getting "lost" during tab switches.
  • Latency vs. Quality — Using PCM 16kHz audio format via ADK allowed for near-instant response times, bridging the gap between "interaction" and "conversation."
  • Visual Logic — Gemini excels at identifying clickable elements purely from visual context, making it far more robust than traditional automation scripts that rely on brittle CSS selectors.
  • Context Pruning — Screenshots are large. Without pruning, the context window bloats fast and hits Error 1007. Axis keeps only the last screenshot as image data, replaces older ones with a text placeholder, and caps history at 20 turns to keep Gemini sharp.

☁️ Cloud & Production Stability

  • Gemini Live API — Getting the API to work reliably required navigating undocumented model aliasing bugs, SDK version dependencies, and regional availability constraints that only surfaced at runtime.
  • Cloud Run Entrypoints — Deploying for the first time revealed how much implicit setup a local environment handles invisibly — credentials, permissions, environment variables, and the shape of the entrypoint file all had to be explicitly configured from scratch.
  • Security Hygiene — Implemented proper .gcloudignore rules after discovering sensitive credential files were being accidentally bundled into source uploads. Manually purged all previously uploaded source zips from the GCS staging bucket.
  • Rate Limiting on Cloud Run — All traffic arrives from the same load balancer IP, causing the rate limiter to throttle all users simultaneously. Fixed by reading the real client IP from X-Forwarded-For. Added exponential backoff with 3 retries and live status updates to the UI on quota hits.
  • Graceful Degradation — Built a noise filter to ignore single-character mic artifacts, output sanitisation to strip internal agent monologue before it reaches the user, and a one-click connection recovery screen instead of a full page crash.

The biggest learning was that multimodal agent development is still a frontier. The tooling is powerful but young — model strings change, SDK versions matter more than expected, and the gap between "it works locally" and "it works in production" is wider than in traditional web development. Shipping Axis under a 4-day deadline made every one of these lessons land hard and fast.

🔭 What's Next for Axis

  • Usage-based pricing via Razorpay
  • Google Calendar API integration (add events directly from Axis)
  • Google Meet API — join calls, mute, unmute, and raise hand by voice
  • Gmail API — compose, reply, archive, and label emails hands-free
  • iframe support for embedded forms
  • Chrome Web Store Launch

Developer: Sharon Varghese
Built for the Gemini Live Agent Challenge 2026.

Built With

  • fastapi
  • firestore
  • gemini-2.5-flash
  • gemini-2.5-flash-image
  • gemini-live-2.5-flash-native-audio
  • google-agent-development-kit
  • google-cloud-build
  • google-cloud-run
  • google-genai
  • google-oauth2
  • httpx
  • manifest
  • slowapi
  • smtp
  • terraform
  • websockets
Share this project:

Updates