PreventX — Autonomous AI SRE 🛡️

Stop firefighting. Start preventing. PreventX is a multi-agent AI Site Reliability Engineer that watches your stack through Dynatrace Grail, predicts failures, and self-heals them — in seconds, before a single user is impacted and before a single alert fires.

Inspiration

Every SRE knows the 3 AM page. By the time an alert fires, users are already hurting — abandoned checkouts, failed payments, churned customers. Industry surveys put unplanned downtime at up to $9,000 per minute, yet the entire monitoring industry is fundamentally reactive: it tells you something broke after it broke.

We asked a different question: what if an autonomous agent could act on the rising signal — before the threshold is crossed, before Davis even opens a Problem, before anyone gets paged? That's the leap from monitoring to prevention, and it's exactly what modern agentic AI makes possible.

What it does

PreventX runs a continuous, fully autonomous SRE loop with two modes:

  • 🟢 Prevention — it scans Grail with natural-language-generated DQL, spots anomalies that are rising but haven't tripped an alert yet, predicts the blast radius, and runs a remediation runbook (cache flush, rolling restart, autoscale, connection drain, rollback…) before the outage materializes.
  • 🔴 Resolution — when a real Dynatrace Problem does open, PreventX detects it, asks Davis Copilot for root cause, executes the fix, and closes the loop back in Dynatrace so the Problem moves to Closed — no human in the loop.

Everything is shown on a live operator dashboard with a transparency layer: you can literally watch the agent think — every DQL query it generated, every Davis Copilot Q&A, every decision, with measured lead time, users protected, and cost avoided. In one demo cycle: 0 users impacted, $128K refund exposure avoided, 47-second autonomous MTTR.

How we built it

  • 🤖 Multi-agent core — Google Agent Development Kit (ADK): a Gemini 2.5 Flash coordinator (with the built-in planner / thinking budget) orchestrates five specialist sub-agents — anomaly detector, topology analyzer, RUM analyzer, prevention executor, and incident resolver — each a focused expert with its own toolset.
  • 🔌 Dynatrace MCP integration: every sub-agent is wired to the official Dynatrace MCP Server, using 18 tools across Grail, Davis, and Davis Copilot — execute_dql, generate_dql_from_natural_language, chat_with_davis_copilot, list_problems, create_dynatrace_notebook, and more.
  • 🧠 Grail + DQL + Davis Copilot: the agent turns natural-language intent into DQL, queries the Grail data lake, and uses Davis Copilot for root-cause reasoning — grounding every decision in real telemetry, not hallucination.
  • ☁️ Google Cloud, end to end: FastAPI + Server-Sent-Events dashboard on Cloud Run, triggered every few minutes by Cloud Scheduler, secrets in Secret Manager, Gemini served via Vertex AI — deployable with a single PowerShell command.
  • 🎬 Chaos harness: a built-in demo injector pushes synthetic logs, metrics, business events, and Problem-triggering events into Grail so the full prevention→resolution loop can be demonstrated live.

Challenges we ran into

  • Long-running agentic tool calls. Davis Copilot calls can take 60–90s; chaining them across a 5-agent graph created MCP session timeout/cleanup races we had to make resilient without losing results.
  • Trustworthy natural-language → DQL. Getting the agent to generate correct DQL and act only on grounded signals — not LLM guesses — required careful prompting, schema contracts, and a structured CycleReport the dashboard could parse safely (including repairing the invalid JSON escapes LLMs love to emit).
  • Closing the loop in Dynatrace. There's no "close problem" API for ingested custom alerts, so we engineered an event-timeout strategy to keep demo Problems alive and then auto-close them the instant the agent reports a successful resolution — real bi-directional write-back, not a simulation.
  • Running real LLM workloads on the cloud. Tuning Gemini concurrency and the Cloud Scheduler cadence so autonomous cycles run reliably within Vertex AI quotas.
  • Make the AI legible. The hardest UX problem wasn't showing data — it was showing reasoning. SREs won't trust an autonomous agent they can't audit, so transparency became a first-class feature.

Accomplishments that we're proud of

  • A genuinely autonomous prevention→resolution loop running in production on Google Cloud — not a scripted demo.
  • A transparency layer that lets anyone watch the agent reason: live DQL, live Davis Copilot Q&A, and a step-by-step execution trace for every intervention.
  • Real bi-directional Dynatrace integration — the agent reads telemetry and writes back, closing Problems for true dashboard ↔ platform consistency.
  • A one-command deploy pipeline (Cloud Run + Scheduler + Secret Manager + IAM) that stands the whole system up from scratch.
  • Concrete, business-readable impact: outages prevented, users protected, cost avoided, and lead time — the metrics an executive actually cares about.

What we learned

  • Agentic design with ADK: how to decompose a hard operational problem into a coordinator + specialist sub-agents, and when thinking budgets actually pay off.
  • MCP goes deep: model-context-protocol tools turn an LLM into an operator with real hands — but production reliability lives in the edge cases (timeouts, partial results, idempotency).
  • Observability for the agent, not just for humans: Grail + DQL gave the agent a queryable, grounded view of reality that made autonomous decisions defensible.
  • Trust is a feature. Autonomy is only adoptable if it's auditable — surfacing the why behind every action mattered as much as the action itself.

What's next for PreventX

  • Real remediation connectors — Kubernetes, Cloud Run, Terraform, and CI/CD rollbacks executed against live infrastructure.
  • Outcome learning — feeding resolution success/failure back into the agent so it improves which runbook it picks over time.
  • Persistent, multi-tenant store so history survives restarts and scales across teams and environments.
  • SLO-aware prioritization and a tunable cost model for accurate, per-org executive reporting.
  • ChatOps — Slack/PagerDuty approvals and "ask PreventX" natural-language incident queries.

Built with: Google ADK · Gemini 2.5 Flash · Vertex AI · Cloud Run · Cloud Scheduler · Secret Manager · Dynatrace Grail · Davis Copilot · Dynatrace MCP Server · FastAPI · Python · React

Built With

  • cloud-scheduler
  • davis-copilot
  • docker
  • dql
  • dynatrace
  • fastapi
  • gemini
  • google-adk
  • google-cloud-run
  • grail
  • jinja
  • mcp
  • python
  • react
  • secret-manager
  • server-sent-events
  • uvicorn
  • vertex-ai
Share this project:

Updates