Agent SRE: autonomous debugging for AI agents

Inspiration

We kept hitting the same wall: AI agents fail silently in production. A triage agent calls the wrong tool, reads an empty log, and confidently declares an outage "healthy" — and nobody notices until a customer does. Today, debugging that means a human engineer scrolling through traces, guessing which span went wrong, and hand-editing a prompt.

We asked: what if the on-call engineer for your AI agents was itself an agent? Site Reliability Engineers keep distributed systems healthy with a tight observe → diagnose → fix → verify loop. We wanted to give agents the same thing — an autonomous SRE that reads another agent's traces and repairs it — built natively on Arize Phoenix observability.

## What it does

Agent SRE is a reliability agent that debugs other AI agents. It watches a target agent (a DevOps incident-triage agent) through its Arize Phoenix traces and runs a full six-step reliability loop, all autonomously, in one click:

Diagnose — reads the failing agent's OpenInference spans via the Phoenix MCP server, and reasons across them to find the causal mistake (the agent called get_pod_logs("payment"), got an empty result, and declared the service healthy).
Measure — scores the agent's output against a rubric: 0/1 FAIL.
Fix — patches the agent's prompt and writes the new version straight into Phoenix's managed prompt store via MCP upsert-prompt.
Verify — re-runs the exact same incident against the live fixed prompt: the verdict flips to 1/1 PASS and the agent now correctly pages the on-call team.
Guard — replays a golden set of incidents as a Phoenix experiment to prove the fix didn't break other cases (4/4).
Prevent — saves the original failing case as a permanent regression example in a Phoenix dataset so the bug can never silently return.

A live cockpit streams the whole loop step-by-step, and every claim deep-links into the real Phoenix object that backs it — the trace, the prompt version history, the experiment, and the dataset. There's also a drift watch that continuously triages a second, intermittent bug from real traces.

## How we built it

Target agent & SRE agent: both built on Google's Agent Development Kit (ADK) with LlmAgent running Gemini 2.5 Flash via Vertex AI (ADC auth — no keys in the image).
Observability: Arize Phoenix with OpenInference auto-instrumentation (phoenix.otel.register(auto_instrument=True)). Every agent run emits spans Phoenix can store and serve.
The agent-to-agent bridge: the SRE agent reaches into Phoenix through the Phoenix MCP server (@arizeai/phoenix-mcp), using list-traces, get-spans, upsert-prompt, and add-dataset-examples as tools. This is the core idea — one agent operating on another agent's telemetry through a standard tool protocol.
Live, not canned: the fix is real because the target reads its prompt from Phoenix at runtime and we rebuild the agent per run. Patch the prompt in Phoenix → the next run genuinely behaves differently.
Backend: FastAPI streaming the loop over SSE; a React + Vite + TypeScript cockpit consuming it live.
Ship: a multi-stage Docker image (Python 3.12 and Node 20, because the app spawns the MCP server via npx) deployed to Google Cloud Run, with the Phoenix key in Secret Manager and a concurrency lock + hourly cap to bound abuse on a public URL.

## Challenges we ran into

The fix wouldn't take. Verify kept failing even after we patched the prompt. The cause: Phoenix's upsert-prompt strips separators from prompt names (incident-triage-agent → incidenttriageagent), so the SRE was patching a different prompt than the agent was reading. We aligned both on the normalized identifier.
Traces routed to the wrong project. Importing the target package eagerly ran setup_tracing(), locking the global OpenTelemetry tracer to the wrong Phoenix project — so 14 drift-watch traces landed in the wrong place. We fixed it by making the package expose its agent lazily via __getattr__.
Guard exposed a lazy fix. Our first prompt fix passed Verify but failed Guard (3/4) — because the original prompt hardcoded the payments service, so it "worked" only for one incident. Guard forced a precise fix: we generalized the prompt to triage whatever service was actually reported, and made the rule contradiction-aware ("empty logs + a high error rate ≠ healthy").
A container that needs two runtimes. adk deploy couldn't help us — our app spawns a Node MCP server and runs Python. We hand-built a multi-stage image with both, pre-warming the pinned MCP package so runtime spawns never hit the registry.
Security pass: caught the Phoenix API key leaking into npx argv (visible in ps) and moved it to the subprocess environment instead.

## Accomplishments that we're proud of

A genuinely autonomous, end-to-end repair — diagnose to prevent — with zero human prompt-editing, where the re-run is provably live rather than scripted.
Agent-debugging-agent through MCP: the SRE never imports the target's internals; it works entirely through Phoenix traces and tools, the way a real SRE works through observability.
Every claim is verifiable. Each step in the cockpit deep-links to the actual Phoenix trace, prompt version, experiment, and dataset — no hand-waving.
Guard caught us cheating. The regression gate rejected our first "fix" and pushed us to a correct, general one — exactly the safety property the loop is supposed to give agents.
Shipped public and live on Cloud Run, cold-start to working demo.

## What we learned

Observability is the right interface for agent autonomy. Traces are a clean, standard contract; an agent that reasons over them can fix problems it was never explicitly programmed for.
MCP turns a monitoring platform into an action surface. Reading a trace and writing a corrected prompt back through the same protocol is what makes the loop closed rather than advisory.
"It passed" is not "it's correct." Verify and Guard measure different things — a single green run can hide a brittle, overfit fix. You need the regression gate.
The unglamorous details (name normalization, tracer singletons, eager imports) are where agent systems quietly break.

## What's next for Agent SRE: autonomous debugging for AI agents

More fix types beyond prompts: tool-schema corrections, retrieval/grounding fixes, and routing changes — all proposed and verified the same way.
Trigger on real drift: wire the drift watch to open an incident and run the loop automatically when live traces cross an eval threshold, instead of on click.
Human-in-the-loop approvals for production-grade safety: propose the fix, show the Verify + Guard evidence, and require a one-click sign-off before shipping.
Multi-agent fleets: point Agent SRE at many target agents and let it triage the whole fleet from one Phoenix space.
Richer guardrails via Phoenix experiments — larger golden sets, automatic regression-case mining from every new failure.

Built With

arize-phoenix
fastapi
gemini
gemini-2.5-flash
google-adk
google-cloud-run
mcp
model-context-protocol
node.js
opentelemetry
python
react
secret-manager
server-sent-events
typescript
vertex-ai

Updates

P J started this project — Jun 11, 2026 07:05 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.