Agent Reliability Guard

Inspiration

Every team shipping a Gemini agent has the same quiet fear: it worked in staging, but in production it's getting slower, more expensive, and subtly worse. Prompt edits, tool-schema changes, and model swaps cause regressions that don't crash anything — they just burn tokens and trust. You find out from a user complaint, a slow dashboard, or a surprise bill at the end of the month.

We wanted to build the safety net we wished we had: a meta-agent whose only job is to watch other agents and call out when one is drifting. The Dynatrace track was the natural home — its MCP server gives DQL over live OpenTelemetry data, Davis analyzers for changepoint detection and forecasting, and notebooks as a real action surface. Nothing about this story works without real runtime data, and Dynatrace was the only partner that gave us all of it from one integration.

What it does

Agent Reliability Guard is a Gemini-powered investigation agent. When a regression is suspected — on demand, on a schedule, or on a release marker — it:

Queries Dynatrace via the MCP gateway for the affected app's runtime signals: p95 latency, tokens per request, and tool error rate, sliced by release.
Runs the Davis changepoint analyzer to locate the exact boundary where the regression began.
Runs the Davis forecasting analyzer to project the cost and latency impact if it stays live for a week.
Writes a real Dynatrace notebook with the evidence, the probable root cause, and the recommended fix.
Notifies the owner and returns a structured RCA to a live dashboard.

The whole investigation streams to a Next.js dashboard over Server-Sent Events, so the operator watches the agent think — every thought, every tool call, every result.

The killer demo: a deliberately broken refund-assistant whose prompt v12 retries refund_check on ambiguous queries. One click, and the Guard catches the regression, explains it ("prompt v12 created a tool retry loop"), and reports it will waste ~$3,400 in tokens and 18 hours of cumulative user wait time over the next week if ignored.

How we built it

Three services, all demoable from one laptop.

Guard agent (agent/): a FastAPI /investigate SSE service. The loop is a ReAct controller around Gemini 2.5 Flash on Vertex AI with forced function-calling. The key design move: the model emits multiple tool calls in one turn, and we fan them out concurrently with asyncio.gather — the three read tools run in parallel inside a single Gemini turn, which is where most of the latency lives.
Demo app (demo-app/): a real Gemini refund-assistant with full OpenTelemetry instrumentation and a healthy/bad mode switch that produces the actual regression in the telemetry.
Frontend (frontend/): a Next.js 14 dashboard — live thinking panel, tool timeline, metrics strip, investigation card — streaming the agent over SSE.
Replay mode: a deterministic, paced run of the real SSE pipeline (only the model and Dynatrace calls stubbed) so the demo is identical every time and cannot fail on stage.

Challenges we ran into

The MCP tool surface wasn't what the docs implied. We assumed execute_dql and a generic execute_davis_analyzer. The live platform gateway actually exposes hyphenated, purpose-built tools: execute-dql, timeseries-novelty-detection (changepoint), and timeseries-forecast. We discovered this by calling tools/list against our own tenant and rewrote the tool layer to the real surface, with graceful fallbacks across builds.
Tool errors hide inside successful responses. The gateway returns failures (missing scope, unknown tool) as isError: true inside a 200 response, not as a JSON-RPC error — so a permission gap looked like an empty success. We now surface those and fall back.
The gateway has no notebook-write tool. So we create a real Dynatrace notebook by posting directly to the Documents REST API with the platform token's document:documents:write scope — a genuine, shareable, multi-section artifact, assembled from the live investigation evidence. Verified end-to-end against our tenant.
Keeping the demo bulletproof. Live Vertex + Dynatrace during a recording is a risk multiplier, so the replay path was non-negotiable.

Accomplishments we're proud of

The Guard runs the investigation we'd actually want to run by hand at 3am — not a chatbot wrapper.
Parallel tool calls inside one Gemini turn cut investigation latency by the depth of the read fan-out.
The notebook is a real Dynatrace artifact, generated from real evidence — not a mockup.
Replay mode means the demo cannot fail on stage: same SSE pipeline, deterministic data.

What we learned

Forced function calling + parallel tool execution in one Gemini turn is the single biggest agent-latency win we found.
Verify the partner's tool surface against your own tenant, not the docs — names and shapes drift.
Dynatrace as an action surface (notebooks + workflows) is what makes the agent feel autonomous instead of advisory.

What's next

Auto-trigger on Dynatrace release events, so every deploy of a tagged Gemini app is investigated automatically.
A compare-two-releases mode and a library of tenant-specific anomaly detectors.
Auto-remediation: open the rollback PR and capture it as a release marker so the next investigation confirms recovery.