Inspiration
On-call engineers don't lack dashboards — they lack time. The hard, risky part of an incident isn't seeing that latency spiked; it's diagnosing why and safely acting at 2 AM. Most "AIOps" tools stop at the alert — or, worse, act fully autonomously and scare everyone off. We wanted an agent that runs the whole loop, including the fix, while keeping a human exactly where the stakes are: the moment before it touches production.
What it does
Remediator is an AIOps action agent. It runs the full incident loop:
- Detects a Cloud Run service degradation through the official Dynatrace MCP server (
list_problems, plus a liveexecute_dqlquery over response time as a resilient fallback). - Diagnoses the root cause with Gemini, correlating the latency spike with a recent deployment event to identify a bad-deploy regression.
- Proposes one concrete, reversible remediation (roll back to the last healthy revision).
- Stops at a hard human-approval gate — nothing executes until a person approves in the oversight console.
- Executes a real Cloud Run traffic rollback from the faulty revision to the healthy one.
- Verifies recovery back in Dynatrace and closes the loop.
How we built it
- Google Agent Development Kit (Agent Builder) orchestrates the agent and its tools, with Gemini 3.5 Flash as the reasoning model — routed via the ADK
LiteLlmadapter so the provider is swappable (Vertex / AI-Studio paths included). - The Dynatrace MCP server is the agent's eyes and hands into observability —
list_problems,execute_dqlover Grail, and deployment events. - A FastAPI + Server-Sent-Events oversight console renders the live timeline and enforces the approval gate.
- A CloudRunExecutor performs the real remediation via the Cloud Run Admin API (traffic split between revisions).
- A sample checkout service instrumented with OpenTelemetry → Dynatrace is the real, breakable target, with healthy and faulty revisions.
Challenges we ran into
- Model availability: Gemini 3 wasn't on Vertex for our project, and the AI-Studio free tier capped at 5 RPM — so we abstracted the model provider behind
LiteLlmand routed through an OpenAI compatible gateway (OpenRouter). The model id stays the same; the provider is a one-line config switch. - Detection on a fresh trial: on an OTLP-only Dynatrace trial, the classic metric-event alerting engine wouldn't fire (a Mint-vs-Grail maturity gap), so we made detection resilient — the agent falls back to a live DQL query on response time instead of depending on a pre-correlated Davis problem.
- Lost telemetry: Cloud Run CPU throttling froze the OTLP background export thread between requests;
--no-cpu-throttlingfixed it.
Accomplishments that we're proud of
- The entire loop runs live on real infrastructure — real Dynatrace MCP, real Gemini reasoning, a real Cloud Run rollback — not a simulation.
- Two safety properties baked into the design, not bolted on: a hard human-approval gate and a scoped mandate (the agent can only ever invoke a tiny, fixed set of reversible actions on one declared service).
- A deterministic, credential-free hosted demo so anyone can watch the full loop in a browser.
What we learned
Trust — not capability — is the bottleneck for autonomous operations. Constraining what an agent is allowed to do, and forcing a human checkpoint at the irreversible step, is what makes "an agent that fixes prod" something a team would actually deploy.
What's next for Remediator
More remediation types (scale-out, config rollback, feature-flag disable), multi-service correlation, and a policy layer so teams can declare which actions are auto-approvable vs. gated.
Log in or sign up for Devpost to join the conversation.