Inspiration

I kept thinking about a thing that bothers me. We instrument our agents with great observability tools like Phoenix, but then nobody actually watches the dashboards. Quality scores can drift for hours and the only reason anyone notices is when a customer complains.

So the obvious question: what if a second agent watched the dashboard for you? Reads the traces, spots the regression, figures out the cause, drafts the fix, and pings a human in Slack with one button to apply it.

That's Mender.

What it does

Every fifteen minutes Mender wakes up on Cloud Run and connects to Phoenix through their MCP server. He scans the last hour of traces from a target agent (in the demo it's a fake fintech support bot called FinPay), clusters the failures, and tries to name the pattern.

When he finds something, he does the thing a human SRE would do. Generates a focused eval set targeting the regression, runs it against the broken agent to confirm the failure, drafts a prompt patch, runs the same evals against the patched version, and only escalates if the patch measurably improves things.

The escalation goes to Slack as an interactive incident card. One click to apply, one click to discard. If you approve, the patch goes live on Cloud Run and Mender goes back to sleep.

The kicker: Mender also reads his own traces every cycle. If his eval generation is too noisy or his confidence threshold is off, he tunes himself for the next run.

How I built it

The agent runtime is Google's Agent Development Kit in Python. Reasoning runs on Gemini 3 via Vertex AI. The whole thing is deployed as two Cloud Run services (mender plus the FinPay target), triggered by Cloud Scheduler every fifteen minutes.

For observability I'm using Arize Phoenix Cloud, and Mender talks to it through the official Phoenix MCP server. That MCP integration is the heart of the project. It's how Mender introspects operational data at runtime, both for the target agent and for himself.

State lives in Firestore. Slack actions go through a HMAC-verified webhook that hits a FastAPI endpoint on the Mender service. Secrets are in Google Secret Manager.

I also built a small web UI (also on Cloud Run) so you can see eval pass/fail badges and incident history without opening Slack.

Stuff that broke

A lot, honestly.

The Phoenix MCP server kept timing out on cold starts during heartbeats. I bumped the timeout from 60 to 180 seconds and tightened the introspection window so Mender wasn't pulling 4 hours of traces every cycle.

I blew through Gemini's million token context window once because the agent loop was accumulating too much trace data. Fixed by capping snapshot loads at 20 and trimming the prompt aggressively before it goes to the model.

Annotation Configs in Phoenix were tricky. There's a real distinction between span-level and trace-level annotations and I had to redo the schema once after picking the wrong scope. Worth it though, the chips populate live now.

Slack interactivity needed a real signing-secret check, which sent me down a fun rabbit hole because Cloud Run's request body handling clashed with Slack's HMAC verification on raw bytes. Got it working with a custom middleware.

What I'm proud of

The end-to-end loop actually runs and produces a real, measurable improvement. In the demo, a regressed FinPay v2 (which silently defaults ambiguous source currencies to USD) goes from 4 out of 10 passing on the eval set to 10 out of 10 after Mender's patch. That's a 60% lift, and it's reproducible.

The self-introspection part also works for real. Mender reads his own Phoenix traces and adjusts his cycle parameters. Felt good when I saw him do that the first time without me prompting him.

What I learned

MCP is the right interface for this category of agent. Giving Mender a structured way to read operational data felt cleaner than building yet another integration. Phoenix being on board with MCP made the whole thing possible.

Also: Gemini 3 via Vertex AI is fast enough that this kind of "agent watching another agent" loop is actually viable on a 15 minute cadence without burning through quota.

And finally, evals are infrastructure. Having LLM-as-judge scoring wired up from day one made every other decision easier because I always had a number to compare against.

What's next

Three things I'd want to do next:

First, support more target agents. Right now FinPay is a fake bot built for the demo. The pipeline is generic but I haven't pointed it at anything else yet.

Second, broader patch types. Mender currently only patches the system prompt. Tool descriptions, model selection, and few-shot examples are the obvious next targets.

Third, learning across cycles. Right now Mender's self-tuning is per-cycle. I'd love for him to remember which kinds of regressions he's good at fixing and which ones he should escalate immediately without trying.

Built With

Share this project:

Updates