Remediator

Inspiration

On-call engineers don't lack dashboards — they lack time. The hard, risky part of an incident isn't seeing that latency spiked; it's diagnosing why and safely acting at 2 AM. Most "AIOps" tools stop at the alert — or, worse, act fully autonomously and scare everyone off. We wanted an agent that runs the whole loop, including the fix, while keeping a human exactly where the stakes are: the moment before it touches production.

What it does

Remediator is an AIOps action agent. It runs the full incident loop:

Detects a Cloud Run service degradation through the official Dynatrace MCP server (list_problems, plus a live execute_dql query over response time as a resilient fallback).
Diagnoses the root cause with Gemini, correlating the latency spike with a recent deployment event to identify a bad-deploy regression.
Proposes one concrete, reversible remediation (roll back to the last healthy revision).
Stops at a hard human-approval gate — nothing executes until a person approves in the oversight console.
Executes a real Cloud Run traffic rollback from the faulty revision to the healthy one.
Verifies recovery back in Dynatrace and closes the loop.

How we built it

Google Agent Development Kit (Agent Builder) orchestrates the agent and its tools, with Gemini 3.5 Flash as the reasoning model — routed via the ADK LiteLlm adapter so the provider is swappable (Vertex / AI-Studio paths included).
The Dynatrace MCP server is the agent's eyes and hands into observability — list_problems, execute_dql over Grail, and deployment events.
A FastAPI + Server-Sent-Events oversight console renders the live timeline and enforces the approval gate.
A CloudRunExecutor performs the real remediation via the Cloud Run Admin API (traffic split between revisions).
A sample checkout service instrumented with OpenTelemetry → Dynatrace is the real, breakable target, with healthy and faulty revisions.

Challenges we ran into

Model availability: Gemini 3 wasn't on Vertex for our project, and the AI-Studio free tier capped at 5 RPM — so we abstracted the model provider behind LiteLlm and routed through an OpenAI compatible gateway (OpenRouter). The model id stays the same; the provider is a one-line config switch.
Detection on a fresh trial: on an OTLP-only Dynatrace trial, the classic metric-event alerting engine wouldn't fire (a Mint-vs-Grail maturity gap), so we made detection resilient — the agent falls back to a live DQL query on response time instead of depending on a pre-correlated Davis problem.
Lost telemetry: Cloud Run CPU throttling froze the OTLP background export thread between requests; --no-cpu-throttling fixed it.

Accomplishments that we're proud of

The entire loop runs live on real infrastructure — real Dynatrace MCP, real Gemini reasoning, a real Cloud Run rollback — not a simulation.
Two safety properties baked into the design, not bolted on: a hard human-approval gate and a scoped mandate (the agent can only ever invoke a tiny, fixed set of reversible actions on one declared service).
A deterministic, credential-free hosted demo so anyone can watch the full loop in a browser.

What we learned

Trust — not capability — is the bottleneck for autonomous operations. Constraining what an agent is allowed to do, and forcing a human checkpoint at the irreversible step, is what makes "an agent that fixes prod" something a team would actually deploy.

What's next for Remediator

More remediation types (scale-out, config rollback, feature-flag disable), multi-service correlation, and a policy layer so teams can declare which actions are auto-approvable vs. gated.

Built With

agent-builder
dynatrace
fastapi
gemini
google-adk
google-cloud-run
litellm
mcp
model-context-protocol
opentelemetry
python
server-sent-events

Updates

Kateryna Ivashchenko started this project — Jun 09, 2026 06:07 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.