Project Story: Visual QA Navigator
Inspiration: The Vision Gap in Modern QA
Traditional UI testing is often a battle against the DOM. We were inspired by the frustration of "brittle selectors" and "flaky tests" that break with the slightest code change, even when the visual user experience remains intact. We asked: What if an agent could simply "see" the application like a human QA engineer does?
The Visual QA Navigator was born from the desire to move beyond the text box and leverage the multimodal power of Gemini 2.0 Flash to interpret pixels, not just code.
How We Built It: Orchestrating a Multimodal Loop
Building the "Eyes & Hands" of an AI agent required a distributed, cloud-native architecture:
The Eyes (Chrome Extension): We built a Manifest V3 extension to capture a live screenshot stream. To optimize token usage and performance, we implemented Perceptual Hashing (pHash) to detect significant visual changes: $$\text{pHash Distance} = \sum_{i=0}^{n} | \text{hash1}_i - \text{hash2}_i |$$ Only when the distance exceeds a calibrated threshold do we trigger the high-level multimodal reasoning loop.
The Brain (FastAPI & Gemini): The backend, hosted on Cloud Run, orchestrates the session. It uses a Sticky Context Buffer to ensure that even in long-running scenarios, the core persona and user goals are protected from context purging.
The Hands (Action Runner): Gemini generates structured tool calls (click, type, scroll). If a DOM selector fails, our Autonomous Self-Healing logic takes over—Gemini identifies the element visually, and we execute the action via precise screen coordinates.
The Command Center (React Dashboard): A modern, glassmorphism-styled dashboard provides a real-time reactive interface (via Firestore) for managing test suites and performing high-fidelity visual audits with our custom overlay slider.
Challenges Faced: Bridging the Interaction Gap
The road to a functional multimodal agent was not without its hurdles:
- Latency & Sync: Synchronizing a high-frequency screenshot stream with AI inference required careful handling of WebSockets and stateful session management.
- Context Retention: Protecting the agent's long-term memory about the test goal while processing hundreds of UI frames was a delicate balancing act of history management.
- Naming & Persistence: [Placeholder: Specific technical hurdles related to your unique local environment or project-specific data race conditions observed during early phases].
What We Learned: The Power of Seeing
In this journey, we learned that multimodal vision isn't just a new "feature"—it's a paradigm shift for automation. By teaching the agent to reason about the visual structure of a page, we created a system that is inherently resilient. We discovered that when an AI can "see," it can "heal," turning a broken test into a minor visual correction rather than a critical failure.
Visual QA Navigator
Built for the Gemini Live Agent Challenge
Built With
- chromeextension
- docker
- fastapi
- firebase
- gemini20flash
- genaisdk
- googlecloudrun
- python311
- react18
- tailwindcss
- vertexai
- vite
- websockets
Log in or sign up for Devpost to join the conversation.