Inspiration
Every dev team has been there: a test fails in CI, you re-run it, it passes, and you spend the next hour wondering if you broke something or if the test is just unreliable. Flaky tests cost real time. The problem is that figuring out whether a failure is a true regression or just an intermittent glitch usually means manually digging through Sentry, re-running tests by hand, and comparing results across runs. We wanted to automate that entire process with an agent that does the detective work for you.
What it does
Flaky Test Hunter takes a test failure report from Sentry and automatically diagnoses whether it is a real bug or a flaky test. The agent parses the failure into a step-by-step browser repro plan, executes it live in a real browser (using Browserbase or local Playwright), and then checks historical run data to spot patterns. If recent runs show a mix of passes and failures, it calls the test flaky. If every recent run fails, it flags it as a likely regression. The result is a clear diagnosis with a confidence score and a recommended action, delivered in seconds.
How we built it
We built the agent using the uAgents framework from ASI:One, which handles the agent identity, messaging, and protocol layers. The core workflow runs in Python and coordinates three main steps: browser-based reproduction, Redis-backed history analysis, and a deterministic classification algorithm. For browser execution, we integrated Browserbase for cloud runs with a local Playwright fallback. We also built a multi-agent mode where a Diagnostician Agent delegates browser execution to a separate Reproducer Agent over the Chat Protocol, then aggregates the results. A lightweight web UI lets users run pre-seeded demo scenarios without needing any credentials.
Challenges we ran into
The hardest part was making the system reliable when individual components fail. If Browserbase is unavailable, it falls back to local execution. If the remote reproducer agent times out, it falls back to a local mock. If Redis is not connected, it uses an in-memory store. Designing those fallback layers without letting any one failure break the whole diagnosis pipeline took a lot of iteration. We also had to define a strict contract between the repro plan and the repro result so that mock and live runners were interchangeable, which meant the demo was never lying about what the agent actually does.
Accomplishments that we're proud of
We are proud that the agent delivers a real diagnosis, not a fake one. The mock runner follows the exact same contract as the live Browserbase runner, so when we demo it without credentials, the logic is identical to what runs in production. We are also proud of the multi-agent architecture: two agents communicating over the ASI:One Chat Protocol, with one delegating work to the other and falling back gracefully if the other is not available.
What we learned
Building reliable agents is mostly about handling failure gracefully. A system that crashes when one dependency is missing is not useful in a real dev environment. We also learned that strict data contracts between components (using Pydantic models for the repro plan and result) make the whole system easier to test and reason about, and they force you to be honest about what your agent is actually doing.
What's next for Flaky Test Hunter
We want to add support for authenticated staging environments, expand the set of supported browser actions, and integrate directly with CI providers like GitHub Actions so the agent can comment on pull requests with a diagnosis automatically.
Log in or sign up for Devpost to join the conversation.