Inspiration

1.3 billion people live with impairments that lock them out of a digital-first world. The tools meant to help them are brittle: screen readers fail on complex layouts, and voice commands break when UIs update. The WebAIM Million report shows 96.3% of homepages fail basic accessibility compliance.

We asked: What if instead of waiting for the web to be rebuilt, we built an AI that used it exactly like a human? No APIs. No DOM parsing. Just vision, intelligence, and voice.

What it does

Phantom is a "Navigator for All" — a multimodal, Zero-DOM Live Agent. It transforms any graphical interface into an accessible, conversational experience.

Unlike tools relying on hidden HTML tags, Phantom "looks" at the screen using Gemini 2.0 Flash Vision, mapping interactive elements to exact [X, Y] coordinates.

When a user speaks: "Find a direct flight from Paris to Dubai"

  1. Sees: Analyzes the visual context.
  2. Hears: Understands intent.
  3. Acts: Types and clicks using relative coordinates.
  4. Speaks: Narrates results naturally using Google Cloud TTS.

If a human eye can see a button, Phantom can click it.

How we built it

Phantom is an enterprise-grade application orchestrated via the Google Agent Development Kit (ADK):

  • Zero-DOM Coordination: Three ADK-orchestrated agents (Screenshot, Analyzer, Action) collaborate to process visual data and translate intent into physical mouse clicks.
  • Real-Time Streaming: We replaced REST with FastAPI WebSockets for fluid, bidirectional streaming of Base64 screens and high-def Neural2-D audio.
  • GCP Native: Containerized on Cloud Run, managing continuous state via Cloud Storage, Firestore, and decoupled with Pub/Sub.

Challenges we ran into

  • Bot Detection: Playwright was immediately flagged by Google Flights. We built a custom "stealth" browser context (spoofing locales/timezones) to bypass firewalls.
  • Latency: REST was too slow for a "Live Agent" feel. Switching to WebSockets cut reaction times down to seconds.
  • Robotic Narration: Early versions sounded like a script. We rewrote the agent to act silently, introducing a final "Summarizer" step where Gemini conversationally describes the result.
  • The DOM Addiction: We forced ourselves to remain strictly "Zero-DOM," requiring intense prompt engineering so Gemini returned highly accurate geometric coordinates without hallucinating.

Accomplishments that we're proud of

  • Flawless Orchestration: Deploying containerized Playwright, WebSockets, and three ADK agents on serverless Cloud Run without timeouts.
  • Zero-DOM Navigation: Successfully navigating dynamic sites like Google Flights and Reddit entirely visually—no querySelector required.
  • Cyberpunk HUD: A frontend that renders Gemini's bounding boxes live on-screen, proving to the user exactly what the AI is "seeing".

What we learned

  • The Constraint is the Feature: Abandoning the DOM makes agents infinitely more resilient to tech debt, visual overhauls, and obfuscated code.
  • Modularity Prevents Hallucinations: Using the ADK to separate the "Eyes" from the "Brain" drastically reduced hallucinations. Small, JSON-structured payloads between agents beat a massive monolithic prompt.
  • Conversational Design is Hard: An agent that clicks fast is a tool. An agent that explains its clicks conversationally is a partner.

What's next for PhantomUi

  • Chrome Extension: Moving out of a sandbox into a native extension to utilize authenticated user sessions.
  • Barge-In Capabilities: Deeply integrating the Gemini Live API audio websocket so users can interrupt Phantom mid-navigation over voice with zero latency.
  • Enterprise RPA: Deploying Phantom to navigate 20-year-old proprietary legacy systems where API integrations are impossible.

Built With

Share this project:

Updates