Inspiration
Traditional browser automation requires writing brittle CSS selectors and Selenium scripts that break whenever a website updates. We wanted to create a system where anyone could automate web tasks by simply showing the computer what to do—just like training a new employee by having them watch over your shoulder.
What it does
Dooley watches a short screen recording of you performing a browser task (like logging into AWS or downloading a report), then replicates that workflow autonomously. It uses AI vision to understand your intent, generates an editable execution plan, and controls a live browser using Set-of-Mark visual grounding—no selectors or scripts required.
How we built it
We combined Gemini 3.0 Pro's 2M context window for video understanding with Gemini Pro for real-time visual grounding. The backend uses FastAPI with Playwright for browser control, while our Set-of-Mark (SoM) injection system overlays numbered badges on interactive elements, enabling the AI to achieve ~99% click accuracy by simply saying "click badge 42."
Challenges we ran into
The biggest challenge was reliable element targeting—early attempts using raw coordinates or AI-generated selectors failed on dynamic pages. We solved this with our hybrid approach: fast-path text matching for speed, falling back to SoM+vision when needed. Handling page transitions and detecting when clicks hit the wrong element also required careful retry logic.
Accomplishments that we're proud of
Our Set-of-Mark system achieves near-perfect click accuracy regardless of screen resolution or page layout. We also built an adaptive recovery system—when an expected element isn't found, the agent can analyze the page and autonomously find alternatives, making it resilient to UI changes.
What we learned
We learned that pure vision AI isn't enough for reliable automation—you need structured grounding like SoM badges to eliminate ambiguity. We also discovered that providing semantic intent ("what the user is trying to achieve") alongside action descriptions dramatically improves the AI's ability to recover from unexpected situations.
What's next for Dooley
We plan to add support for multi-step workflows with branching logic, credential management for secure logins, and a library of shareable "recipes" that users can run with one click. Long-term, we envision Dooley becoming an AI teammate that can handle entire business processes by watching just a few examples.
Log in or sign up for Devpost to join the conversation.