Triage

Triage - Paste a GitHub issue. Watch a real browser reproduce the bug.

Inspiration: Every developer knows the feeling. A user reports a bug, you read the ticket, and you have no idea how to make it happen again. The issue sits in the backlog for days — not because it's hard to fix, but because nobody can reproduce it. You can't fix what you can't see.

The standard tools tell you what broke after the fact. Log aggregators, crash reporters, stack traces — they all require the bug to be instrumented in advance, and they still leave the developer doing the manual work of recreating the exact sequence of actions that triggered it. That manual step is where bugs go to die.

Triage is built around one insight: the only way to truly reproduce a bug is to use the app the way the user did. Not read logs. Not search the codebase. Actually open the app and click through it until it breaks.

What it does:

Triage takes a GitHub issue URL — written in vague, human prose — and turns it into a confirmed bug reproduction, automatically.

A developer pastes a link. Three coordinating agents take over from there:

ParserAgent reads the issue and infers structured reproduction steps, including the preconditions the user never mentioned. ReproAgent opens a real cloud browser and clicks through the live app — typing, clicking, navigating exactly like a person would. When the bug fires, it captures the screenshot, the console error, and the full session. HypothesisAgent reads the evidence, diagnoses the root cause from behavior alone — no source code required — and if the first attempt missed something, redirects the system to try again with corrected steps.

What comes out at the end is a structured report: confirmed reproduction steps, a root cause hypothesis with confidence level, per-step pass/crash breakdown, and embedded session replays for both attempts. Everything a developer needs to go straight to the fix.

How we built it

Claude — the reasoning layer across all three agents: Claude runs inside every agent. ParserAgent uses it to read vague issue prose and infer structured steps with unstated preconditions. HypothesisAgent uses it to reason from a console error and crash behavior to a root cause — diagnosing from symptoms, not source code. The final synthesis pass uses it to turn raw evidence into the structured report. Claude isn't a single call at the end — it's the reasoning engine that makes each agent intelligent.

Browserbase + Stagehand — the hands: Browserbase is not bolted on. It is the reproduction. Without a real cloud browser executing real actions against a real URL, Triage is just a text parser. Browserbase is what makes the claim "it actually uses your app" true rather than metaphorical.

Every repro attempt spins a fresh Browserbase session — no reused state, clean slate — and drives it with Stagehand act() calls. Each step is a natural language instruction: "focus the input," "type a task name," "click the delete button." Stagehand translates those into actual browser events. Screenshots are captured after every step. The console error stream is monitored as the unambiguous bug-detection signal.

The Browserbase Live View is what the demo shows: a real browser, in the cloud, clicking through a real app in real time. The session replay links go directly into the final report so anyone can watch the full run back, frame by frame. The Session Inspector gives the full log of every network request and action taken — used during development to debug the agent's own behavior.

If you pulled Browserbase out of Triage, there is no product. It is the execution layer the entire system exists to drive.

Band — the coordination room: Three agents. One shared room. Every message routed by @mention.

ParserAgent, ReproAgent, and HypothesisAgent each have their own identity, their own role, and their own logic. They are not functions called in sequence. They communicate exclusively through the Band room, and they only act when directly @mentioned — no agent responds to noise addressed to someone else.

The room transcript is what makes the coordination visible and verifiable. A judge can read the Band room and watch the agents actually talk: ParserAgent hands off steps, ReproAgent reports a failure, HypothesisAgent diagnoses and redirects, ParserAgent revises, ReproAgent retries. That is a real conversation between real agent identities, not a log of function calls with labels applied after the fact.

Band is load-bearing in a specific way: the retry loop — the mechanism that makes Triage more than a one-shot script — is entirely driven by agent-to-agent coordination through the room. HypothesisAgent's redirect message to ParserAgent is what triggers the re-parse and the second attempt. Pull Band out and the retry loop collapses. The system can no longer self-correct.

Arize — observability and memory:

Arize does two things in Triage, and the second one is what makes it genuinely different from a logging tool.

Observability. Every run produces a triage_run trace with child spans per attempt. Each repro_attempt span breaks down into stagehand_action spans — one per browser step — tagged with outcome, screenshot URL, and any console errors captured. The full decision trail from issue input to final report is in one place, navigable and shareable.

Evaluation. After each attempt, an LLM judge scores it on three dimensions:

MetricWhat it measuresrepro_fidelityDid it actually reproduce the bug?root_cause_correctnessWas the diagnosis right?honestyDid the system accurately report what happened?

In the demo, attempt one scores repro_fidelity=0\text{repro_fidelity} = 0 repro_fidelity=0 — it didn't reproduce the bug, and the system said so. Attempt two scores repro_fidelity=1\text{repro_fidelity} = 1 repro_fidelity=1. That 0→10 \to 1 0→1 shift is visible directly in the trace and is the evidence that the improvement is real, not claimed.

The outer loop. This is where Arize becomes the memory. The system doesn't just write traces — it reads them back. Before each new run on the same bug, Triage queries Arize for prior scored history, distills a lesson from what failed and what worked, and injects that context into the Band room as the starting point. In the demo, this shows up as a 🧠 Prior-run memory message in the Band room — derived from real Arize traces, shaping the first attempt before a single browser action fires. The result: the same bug that took a retry last time gets reproduced on the first attempt.

Arize isn't recording what happens. It's the feedback loop that makes the system learn.

The two loops

┌─────────────────────────────────────────────────────┐ │ INNER LOOP │ │ (within one run) │ │ │ │ Attempt fails → ReproAgent reports to Band room │ │ → HypothesisAgent diagnoses + redirects │ │ → ParserAgent re-derives steps │ │ → Fresh Browserbase session → Attempt 2 │ └─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐ │ OUTER LOOP │ │ (across runs) │ │ │ │ Arize stores eval scores + diagnoses per run │ │ → Next run queries that scored history │ │ → Lesson injected into Band room as memory │ │ → Attempt 1 starts smarter │ └─────────────────────────────────────────────────────┘

The inner loop makes Triage self-correcting within a run. The outer loop makes it self-improving across runs.

Architecture

[GitHub Issue URL] │ ▼ ┌─────────────────────┐ │ ParserAgent │ ── Claude: prose → structured steps │ (Band room) │ └─────────────────────┘ │ @ReproAgent ▼ ┌─────────────────────┐ ┌─────────────────────┐ │ ReproAgent │ ──► │ Browserbase │ │ (Band room) │ │ Stagehand act() │ └─────────────────────┘ │ Live View + Replay │ │ @HypothesisAgent └─────────────────────┘ ▼ ┌─────────────────────┐ │ HypothesisAgent │ ── Claude: evidence → root cause │ (Band room) │ ── redirect → @ParserAgent (if retry) └─────────────────────┘ │ ▼ ┌─────────────────────┐ │ Arize │ ── spans, evals, outer-loop memory └─────────────────────┘ │ ▼ ┌─────────────────────┐ │ Report Card │ ── verdict, steps, root cause, replays └─────────────────────┘

Challenges we ran into

Browserbase — reliable bug detection. The hardest problem was knowing whether the bug actually fired versus the app just being slow. We couldn't rely on URL changes or element absence. The solution was monitoring the browser's console error stream: the specific TypeError became our unambiguous detection signal, combined with a blank-screen screenshot as a second confirmation. Getting Stagehand to sequence actions reliably without timing races — especially across a fail-then-retry loop with a fresh session — took significantly more iteration than expected.

Band — three live connections, deterministic routing. Keeping three WebSocket connections alive simultaneously at a demo booth is a real reliability concern. The bigger challenge was enforcing strict @mention discipline: any message without a correct @mention reaches nobody, so a routing mistake silently breaks the coordination loop. We built the agent message schema around this constraint from the start rather than retrofitting it, which meant locking the Band module interface before any agent logic was written.

Arize — the outer loop indexing lag. The Arize trace list view is backed by an index that lags behind the primary store. A trace exists and is fully queryable by trace_id within seconds of a run completing, but doesn't appear in the time-range list for several minutes. For the outer loop memory read, we query by trace_id directly rather than scanning the list, which makes the memory load reliably fast (~3.7s from 4 prior runs) regardless of index lag.

Accomplishments we're proud of

The outer loop memory is the one we're most proud of — Arize going from passive observer to active memory that shapes the next run is a meaningful architectural idea, and watching the 🧠 line appear in the Band room with a real lesson derived from real scored traces is the moment the whole system feels like it's actually learning.

The other is the honesty of the evaluation. bug.detected flips for real. The LLM judge scores repro_fidelity=0\text{repro_fidelity} = 0 repro_fidelity=0 on a genuine failure and repro_fidelity=1\text{repro_fidelity} = 1 repro_fidelity=1 on a genuine success — not because we tuned it to look good, but because the browser actually went blank on attempt two and not on attempt one. The scores mean something.

What we learned

Depth beats breadth. Every integration in Triage is load-bearing — Browserbase is the execution layer, Band is the coordination layer, Arize is the memory layer. None of them could be removed without the system breaking in a fundamental way. That constraint forced better architecture decisions than we would have made if the integrations were optional add-ons.

We also learned that the hard part of browser automation isn't the browser. Stagehand makes the individual actions straightforward. The hard part is sequencing — knowing when state is ready, detecting whether a step actually succeeded, and deciding what failure means for the overall run. That's where the agent reasoning earns its place.

What's next for Triage

The planted bug in the demo is deliberate — but Triage isn't. The pipeline works against any web app with a GitHub issue. The next step is broader testing across a wider class of frontend bugs, and adding one more output to the report: after the hypothesis is formed, use GitHub's API to search the repo and point at the exact file and line. The root cause exists already. Pointing at the code is one API call away.

The outer loop is the longer-term moat. The more runs Triage does on a codebase, the richer the scored history in Arize becomes, and the smarter the starting position on the next run. That compound improvement is what turns a hackathon demo into something a team would actually run every morning.