Architectural Diagram.
Axis running on Google cloud Run (Discribing what it's seeing)
Image generation based on screen context (Both Live and chat session)
Get back to where you left of...
Drag and Drop files to Axis to give context
Don't forget to drop your Feedback!

🧭 Axis: Voice-Driven Browser Agent

Hackathon Category: UI Navigator ☸️
Agent Framework: Google Agent Development Kit (ADK) v0.3.0+
Google Cloud Services: Vertex AI · Cloud Run · Cloud Build · Firestore · Artifact Registry · Google OAuth 2.0

Gemini Models Used:

gemini-live-2.5-flash-native-audio - gemini-2.5-flash - gemini-2.5-flash-image

Backend: Hosted on Google Cloud Run (us-central1) — always on, no local server required.

Live Endpoint Health Check

Inspiration

Traditional automation tools and accessibility software struggle with the modern web. They depend on brittle selectors, rigid macros, and interfaces that break the moment a UI layout changes.

At the same time, millions of people with motor disabilities—including individuals living with ALS, Parkinson’s disease, or spinal cord injuries—face significant barriers when interacting with computers. Keyboard and mouse interaction can be exhausting, painful, or sometimes impossible.

The idea behind Axis was simple but powerful:

What if your browser could see the screen the way a human assistant does, understand your intent through natural conversation, and execute actions on your behalf?

By combining Gemini’s multimodal reasoning with real-time voice interaction, Axis turns the browser into a voice-driven agent that listens, observes the screen, understands UI context, and performs actions autonomously.

Instead of rigid automation scripts, the browser becomes a context-aware AI assistant.

What Axis Does

Axis is a voice-driven browser agent that acts as the user’s hands on the screen.

It observes the browser interface through screenshots and uses Gemini multimodal reasoning to interpret visual UI elements and determine the correct actions to perform.

Users can interact naturally with the browser:

“Scroll down and open the first article.”
“Search YouTube for AI tutorials.”
“Fill this form using the information in my uploaded document.”

Axis then translates the intent into executable browser actions such as:

Clicking
Typing
Navigating
Scrolling

Core Features

🎙️ Live Voice Navigation — Talk to your browser. Axis sees the screen and acts — clicking, typing, scrolling, and navigating across sites through natural, interruptible voice commands.
💬 AI Chat Mode — Full text-based agent with the same screenshot awareness and DOM execution as voice mode. Sessions are saved and searchable.
👁️ Visual Screen Understanding — Uses real-time screenshots instead of DOM scrapers. Works on any website regardless of UI framework, shadow DOMs, or dynamic content.
🖼️ AI Image Generation — Ask Axis to generate an image mid-conversation. Powered by gemini-2.5-flash-image. Download with one click.
📁 File Upload — Drag and drop PDFs, images, or documents onto the session panel. Axis reads and reasons about them in context.
🕘 Session History — Every session auto-saved to Firestore with a Gemini-generated headline summary, so you know where to get back.
♿ Accessibility First — Built for users with motor disabilities, ALS, Parkinson's, or any condition that makes keyboard and mouse interaction difficult, and for users fascinated with AI Automation out of the chat box.

Tech Stack

Layer	Technology
AI Models	`gemini-live-2.5-flash-native-audio` · `gemini-2.5-flash` · `gemini-2.5-flash-image`
Agent Framework	Google Agent Development Kit (ADK) v0.3.0+
AI SDK	`google-genai` v0.8.0+
Backend	Python · FastAPI · WebSockets · Asyncio
Frontend	Chrome Extension (Manifest V3) · Vanilla JS · HTML · CSS
Database	Google Firestore (Session & Transcript Storage)
Hosting	Google Cloud Run (us-central1)
CI/CD	Google Cloud Build · Artifact Registry
Auth	Google OAuth 2.0 via `chrome.identity`
Infrastructure	Terraform · `cloudbuild.yaml`
Rate Limiting	Slowapi (Backend Protection)
Email	SMTP (feedback delivery)
HTTP Client	httpx (Async Tool Execution)

Learning Curve

🦾 AI & Multimodal Reasoning

Navigation Desync — Tracking explicit Window IDs is critical for agents operating in sidepanels to prevent the agent from getting "lost" during tab switches.
Latency vs. Quality — Using PCM 16kHz audio format via ADK allowed for near-instant response times, bridging the gap between "interaction" and "conversation."
Visual Logic — Gemini excels at identifying clickable elements purely from visual context, making it far more robust than traditional automation scripts that rely on brittle CSS selectors.
Context Pruning — Screenshots are large. Without pruning, the context window bloats fast and hits Error 1007. Axis keeps only the last screenshot as image data, replaces older ones with a text placeholder, and caps history at 20 turns to keep Gemini sharp.

☁️ Cloud & Production Stability

Gemini Live API — Getting the API to work reliably required navigating undocumented model aliasing bugs, SDK version dependencies, and regional availability constraints that only surfaced at runtime.
Cloud Run Entrypoints — Deploying for the first time revealed how much implicit setup a local environment handles invisibly — credentials, permissions, environment variables, and the shape of the entrypoint file all had to be explicitly configured from scratch.
Security Hygiene — Implemented proper .gcloudignore rules after discovering sensitive credential files were being accidentally bundled into source uploads. Manually purged all previously uploaded source zips from the GCS staging bucket.
Rate Limiting on Cloud Run — All traffic arrives from the same load balancer IP, causing the rate limiter to throttle all users simultaneously. Fixed by reading the real client IP from X-Forwarded-For. Added exponential backoff with 3 retries and live status updates to the UI on quota hits.
Graceful Degradation — Built a noise filter to ignore single-character mic artifacts, output sanitisation to strip internal agent monologue before it reaches the user, and a one-click connection recovery screen instead of a full page crash.

The biggest learning was that multimodal agent development is still a frontier. The tooling is powerful but young — model strings change, SDK versions matter more than expected, and the gap between "it works locally" and "it works in production" is wider than in traditional web development. Shipping Axis under a 4-day deadline made every one of these lessons land hard and fast.

🔭 What's Next for Axis

Usage-based pricing via Razorpay
Google Calendar API integration (add events directly from Axis)
Google Meet API — join calls, mute, unmute, and raise hand by voice
Gmail API — compose, reply, archive, and label emails hands-free
iframe support for embedded forms
Chrome Web Store Launch

Developer: Sharon Varghese
Built for the Gemini Live Agent Challenge 2026.

Built With

fastapi
firestore
gemini-2.5-flash
gemini-2.5-flash-image
gemini-live-2.5-flash-native-audio
google-agent-development-kit
google-cloud-build
google-cloud-run
google-genai
google-oauth2
httpx
manifest
slowapi
smtp
terraform
websockets

Updates

Sharon Varghese started this project — Mar 15, 2026 04:40 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.