Architecture diagram
Dashboard frontend
Chrome extension
Chrome extension live

Project Story: Visual QA Navigator

Inspiration: The Vision Gap in Modern QA

Traditional UI testing is often a battle against the DOM. We were inspired by the frustration of "brittle selectors" and "flaky tests" that break with the slightest code change, even when the visual user experience remains intact. We asked: What if an agent could simply "see" the application like a human QA engineer does?

The Visual QA Navigator was born from the desire to move beyond the text box and leverage the multimodal power of Gemini 2.0 Flash to interpret pixels, not just code.

How We Built It: Orchestrating a Multimodal Loop

Building the "Eyes & Hands" of an AI agent required a distributed, cloud-native architecture:

The Eyes (Chrome Extension): We built a Manifest V3 extension to capture a live screenshot stream. To optimize token usage and performance, we implemented Perceptual Hashing (pHash) to detect significant visual changes: $$\text{pHash Distance} = \sum_{i=0}^{n} | \text{hash1}_i - \text{hash2}_i |$$ Only when the distance exceeds a calibrated threshold do we trigger the high-level multimodal reasoning loop.
The Brain (FastAPI & Gemini): The backend, hosted on Cloud Run, orchestrates the session. It uses a Sticky Context Buffer to ensure that even in long-running scenarios, the core persona and user goals are protected from context purging.
The Hands (Action Runner): Gemini generates structured tool calls (click, type, scroll). If a DOM selector fails, our Autonomous Self-Healing logic takes over—Gemini identifies the element visually, and we execute the action via precise screen coordinates.
The Command Center (React Dashboard): A modern, glassmorphism-styled dashboard provides a real-time reactive interface (via Firestore) for managing test suites and performing high-fidelity visual audits with our custom overlay slider.

Challenges Faced: Bridging the Interaction Gap

The road to a functional multimodal agent was not without its hurdles:

Latency & Sync: Synchronizing a high-frequency screenshot stream with AI inference required careful handling of WebSockets and stateful session management.
Context Retention: Protecting the agent's long-term memory about the test goal while processing hundreds of UI frames was a delicate balancing act of history management.
Naming & Persistence: [Placeholder: Specific technical hurdles related to your unique local environment or project-specific data race conditions observed during early phases].

What We Learned: The Power of Seeing

In this journey, we learned that multimodal vision isn't just a new "feature"—it's a paradigm shift for automation. By teaching the agent to reason about the visual structure of a page, we created a system that is inherently resilient. We discovered that when an AI can "see," it can "heal," turning a broken test into a minor visual correction rather than a critical failure.

Visual QA Navigator
Built for the Gemini Live Agent Challenge

Built With

chromeextension
docker
fastapi
firebase
gemini20flash
genaisdk
googlecloudrun
python311
react18
tailwindcss
vertexai
vite
websockets

Updates

Ronald Kuseski started this project — Mar 16, 2026 07:50 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.