Inspiration 💡
The modern web is fundamentally broken for visually impaired users.
Traditional screen readers rely on parsing the Document Object Model (DOM). But today's web is filled with complex Single Page Applications (SPAs), dynamically injected <div> tags masquerading as buttons, and missing ARIA labels. When a screen reader hits a cluttered, notoriously clunky real-world website like a local grocery checkout or a DMV portal it often reads a wall of meaningless code, leaving the user stranded.
Focusing heavily on the human side of tech, I asked myself: What if an AI didn't parse the code, but simply looked at the screen exactly like a human does? Inspired by the Google Gemini Live Agent Challenge, I set out to build IAN (Intelligent Accessibility Navigator). IAN acts as a digital equalizer for the visually impaired. It is a multimodal Next-Gen Agent that bypasses the DOM entirely. You speak to it naturally, and it uses vision-based reasoning to physically navigate, click, and type through the web on your behalf.
What it does
IAN is a voice-controlled, fully autonomous web navigator tailored for real world accessibility.
- You Speak: Using a high-contrast, Neo-Brutalist React dashboard designed specifically for low-vision users, the user holds a button and speaks a natural command (e.g., "Go to Amazon and search for running shoes").
- IAN Listens: The audio is streamed in real-time to the
gemini-2.5-flash-native-audiomodel to instantly extract the user's intent. - IAN Sees & Acts: A headless Playwright browser spins up on the backend. IAN takes a screenshot of the page, feeds it to
gemini-2.5-flash(Vision), calculates the exact $(X, Y)$ pixel coordinates of the target element, and physically clicks or types. - IAN Reports: The live browser view and AI's actions are streamed back to the user's dashboard in real-time.
How we built it
To move at startup speed while maintaining enterprise-grade stability, we combined rapid prototyping with robust cloud infrastructure:
- "Vibe Coding" the UI: To rapidly prototype the frontend, I utilized Google Antigravity and the Google Stitch MCP skills. This allowed me to "vibe code" the high-contrast Neo-Brutalist React dashboard at lightning speed, wiring up the WebSocket connections to the backend in record time.
- The Google ADK (Agent Development Kit): I utilized the ADK's
InMemorySessionServiceto orchestrate this multi-agent system, managing the complex WebSocket streaming between the frontend React app and the Gemini Live API. - Dual-Model Architecture: To prevent heavy visual processing from blocking the audio WebSocket loop, I decoupled the agent into two brains: an Audio Orchestrator (VAD and intent parsing) and a Visual Navigator (Playwright automation).
- Enterprise-Grade Infrastructure: The clean FastAPI backend and Dockerized headless Chromium instances are deployed on Google Cloud Run configured with
min-instances 0for strict cost-control. Furthermore, I integrated Google Cloud Secret Manager to securely inject my Gemini API keys at runtime, ensuring zero credential exposure in the codebase.
Challenges we ran into
Building a multimodal Live Agent is incredibly complex. Here are the hurdles we overcame:
1. Playwright Thread Conflicts & Error Handling:
Managing a synchronous headless browser inside a highly concurrent FastAPI WebSocket server caused severe thread blocking. I implemented robust error handling and utilized asyncio.to_thread with strict asyncio.Lock() mechanisms to resolve Playwright thread conflicts and prevent concurrency explosions.
2. The Web Audio API to Gemini Pipeline: Browsers natively capture audio in Float32 format at 44.1kHz. However, the Gemini Native Audio model strictly requires 16kHz, 16-bit PCM. If you send the wrong format, the LLM hallucinates static. I built a custom JavaScript audio processor to manually downsample and convert the raw audio buffers on the fly using this transformation: $$PCM_{16}[i] = \max(-32768, \min(32767, Float_{32}[i] \times 32768))$$
3. The ADK Tool Execution Crash:
Passing complex browser-automation tools directly into the gemini-2.5-flash-native-audio root agent caused persistent 1008 API disconnects. To fix this, I stripped the tools from the ADK agent and implemented a custom "Hackathon Interceptor." I prompted the audio model to silently output a strict [NAVIGATE: goal] tag, intercepted it in the FastAPI loop, and manually triggered the background Playwright thread.
Accomplishments that we're proud of 🏆
- Bypassing the DOM: Proving that we can navigate complex, heavily obfuscated real-world websites using purely visual spatial reasoning instead of brittle HTML selectors.
- Thread-Safe Multimodality: Successfully keeping a live, full-duplex WebSocket connection open while simultaneously running a heavy, synchronous headless browser in the background without dropping frames.
- Secure Cloud-Native Deployment: Successfully containerizing a massive Playwright environment and deploying it securely on Google Cloud Run with Secret Manager and strict IAM policies.
What we learned 🧠
- I learned how to move at incredible speeds by "vibe coding" with Google Antigravity and Stitch.
- I gained a deep, low-level understanding of the Web Audio API,
AudioContext, and raw PCM byte manipulation. - I learned how to effectively utilize the new Google Agent Development Kit (ADK) to manage session state across a complex multi-agent system.
What's next for IAN: Intelligent Accessibility Navigator 🔮
This hackathon was just the proof of concept. The next steps for IAN include:
- Multi-Tab Memory: Giving the agent a contextual memory graph so it can open new tabs, compare products, and synthesize research for the user.
- Chrome Extension: Migrating the Cloud Run backend logic into a local Chrome Extension so IAN can drive the user's actual local browser instead of a proxy headless instance.
- Action Verification: Implementing a validation loop where IAN double-checks the DOM state after a visual click to ensure the intended action actually occurred.
The era of struggling with inaccessible HTML is over. With Gemini's multimodal capabilities, if you can see it on the screen, IAN can click it for you.
Built With
- fastapi
- gemini-2.5-flash
- gemini-native-audio
- google-agent-development-kit-adk
- google-antigravity
- google-cloud-run
- google-cloud-secret-manager
- google-stitch-mcp
- next.js
- playwright
- python
- react
- tailwind-css
- vertex-ai
- web-audio-api
- websockets
Log in or sign up for Devpost to join the conversation.