Inspiration The "Action Era" of AI is about moving from chatbots to agents that actually solve friction.1 Every developer has faced the nightmare of a "5-second bug report" video where "it doesn't work," but the exact state is impossible to reproduce. I wanted to build a tool that doesn't just see pixels, but understands the causal logic behind a crash .
How it was Built The project was "vibe-coded" entirely within the AI Studio Build Mode, utilizing a React frontend architecture.1 The core engine uses a multi-model pipeline:
Vision Analysis:
gemini-3-flash-previewscans the uploaded.mp4at a high frame rate (up to 10 FPS) to identify UI components and user actions .State Preservation: I implemented a custom state manager using the browser’s
localStorageto capture and circulate Thought Signatures . This ensures that when the agent transitions from "watching" to "coding," it retains the exact internal reasoning state .Autonomous Reasoning:
gemini-3-pro-previewis invoked withthinking_level: "high"to perform multi-step planning.2 It compares the visual "Point of Failure" against grounded documentation retrieved via the Google Search Tool.4
Challenges The primary hurdle was temporal blindness—the tendency for models to lose the chronological order of events in long contexts . I solved this by forcing the model to produce a Verification Artifact 6: a timestamped log that maps every visual action to a DOM state.
Learnings I learned that in the Action Era, a prompt is an architecture.7 To avoid the 400 validation error common in multi-turn tool use, you must treat Thought Signatures as mandatory cryptographic save-states.8 The efficiency of the video tokenization can be represented by the following cost estimation for a 60-second clip:
$$C_{total} = \sum_{t=1}^{T} (F_{tokens} \cdot s) + A_{tokens} + M_{tokens}$$
Where $s$ is the sampling rate (FPS), $F$ represents frame tokens, $A$ is audio, and $M$ is metadata.
Built With
- adk
- cypress
- gemini-3-pro-preview
- geminiapi
- playwright
- react
Log in or sign up for Devpost to join the conversation.