Inspiration

On-call engineers don't lack dashboards — they lack time. The hard, risky part of an incident isn't seeing that latency spiked; it's diagnosing why and safely acting at 2 AM. Most "AIOps" tools stop at the alert — or, worse, act fully autonomously and scare everyone off. We wanted an agent that runs the whole loop, including the fix, while keeping a human exactly where the stakes are: the moment before it touches production.

What it does

Remediator is an AIOps action agent. It runs the full incident loop:

  • Detects a Cloud Run service degradation through the official Dynatrace MCP server (list_problems, plus a live execute_dql query over response time as a resilient fallback).
  • Diagnoses the root cause with Gemini, correlating the latency spike with a recent deployment event to identify a bad-deploy regression.
  • Proposes one concrete, reversible remediation (roll back to the last healthy revision).
  • Stops at a hard human-approval gate — nothing executes until a person approves in the oversight console.
  • Executes a real Cloud Run traffic rollback from the faulty revision to the healthy one.
  • Verifies recovery back in Dynatrace and closes the loop.

How we built it

  • Google Agent Development Kit (Agent Builder) orchestrates the agent and its tools, with Gemini 3.5 Flash as the reasoning model — routed via the ADK LiteLlm adapter so the provider is swappable (Vertex / AI-Studio paths included).
  • The Dynatrace MCP server is the agent's eyes and hands into observability — list_problems, execute_dql over Grail, and deployment events.
  • A FastAPI + Server-Sent-Events oversight console renders the live timeline and enforces the approval gate.
  • A CloudRunExecutor performs the real remediation via the Cloud Run Admin API (traffic split between revisions).
  • A sample checkout service instrumented with OpenTelemetry → Dynatrace is the real, breakable target, with healthy and faulty revisions.

Challenges we ran into

  • Model availability: Gemini 3 wasn't on Vertex for our project, and the AI-Studio free tier capped at 5 RPM — so we abstracted the model provider behind LiteLlm and routed through an OpenAI compatible gateway (OpenRouter). The model id stays the same; the provider is a one-line config switch.
  • Detection on a fresh trial: on an OTLP-only Dynatrace trial, the classic metric-event alerting engine wouldn't fire (a Mint-vs-Grail maturity gap), so we made detection resilient — the agent falls back to a live DQL query on response time instead of depending on a pre-correlated Davis problem.
  • Lost telemetry: Cloud Run CPU throttling froze the OTLP background export thread between requests; --no-cpu-throttling fixed it.

Accomplishments that we're proud of

  • The entire loop runs live on real infrastructure — real Dynatrace MCP, real Gemini reasoning, a real Cloud Run rollback — not a simulation.
  • Two safety properties baked into the design, not bolted on: a hard human-approval gate and a scoped mandate (the agent can only ever invoke a tiny, fixed set of reversible actions on one declared service).
  • A deterministic, credential-free hosted demo so anyone can watch the full loop in a browser.

What we learned

Trust — not capability — is the bottleneck for autonomous operations. Constraining what an agent is allowed to do, and forcing a human checkpoint at the irreversible step, is what makes "an agent that fixes prod" something a team would actually deploy.

What's next for Remediator

More remediation types (scale-out, config rollback, feature-flag disable), multi-service correlation, and a policy layer so teams can declare which actions are auto-approvable vs. gated.

Built With

  • agent-builder
  • dynatrace
  • fastapi
  • gemini
  • google-adk
  • google-cloud-run
  • litellm
  • mcp
  • model-context-protocol
  • opentelemetry
  • python
  • server-sent-events
Share this project:

Updates