PreventX — Autonomous AI SRE 🛡️
Stop firefighting. Start preventing. PreventX is a multi-agent AI Site Reliability Engineer that watches your stack through Dynatrace Grail, predicts failures, and self-heals them — in seconds, before a single user is impacted and before a single alert fires.
Inspiration
Every SRE knows the 3 AM page. By the time an alert fires, users are already hurting — abandoned checkouts, failed payments, churned customers. Industry surveys put unplanned downtime at up to $9,000 per minute, yet the entire monitoring industry is fundamentally reactive: it tells you something broke after it broke.
We asked a different question: what if an autonomous agent could act on the rising signal — before the threshold is crossed, before Davis even opens a Problem, before anyone gets paged? That's the leap from monitoring to prevention, and it's exactly what modern agentic AI makes possible.
What it does
PreventX runs a continuous, fully autonomous SRE loop with two modes:
- 🟢 Prevention — it scans Grail with natural-language-generated DQL, spots anomalies that are rising but haven't tripped an alert yet, predicts the blast radius, and runs a remediation runbook (cache flush, rolling restart, autoscale, connection drain, rollback…) before the outage materializes.
- 🔴 Resolution — when a real Dynatrace Problem does open, PreventX detects it, asks Davis Copilot for root cause, executes the fix, and closes the loop back in Dynatrace so the Problem moves to Closed — no human in the loop.
Everything is shown on a live operator dashboard with a transparency layer: you can literally watch the agent think — every DQL query it generated, every Davis Copilot Q&A, every decision, with measured lead time, users protected, and cost avoided. In one demo cycle: 0 users impacted, $128K refund exposure avoided, 47-second autonomous MTTR.
How we built it
- 🤖 Multi-agent core — Google Agent Development Kit (ADK): a Gemini 2.5 Flash coordinator (with the built-in planner / thinking budget) orchestrates five specialist sub-agents — anomaly detector, topology analyzer, RUM analyzer, prevention executor, and incident resolver — each a focused expert with its own toolset.
- 🔌 Dynatrace MCP integration: every sub-agent is wired to the official Dynatrace MCP Server, using 18 tools across Grail, Davis, and Davis Copilot —
execute_dql,generate_dql_from_natural_language,chat_with_davis_copilot,list_problems,create_dynatrace_notebook, and more. - 🧠 Grail + DQL + Davis Copilot: the agent turns natural-language intent into DQL, queries the Grail data lake, and uses Davis Copilot for root-cause reasoning — grounding every decision in real telemetry, not hallucination.
- ☁️ Google Cloud, end to end: FastAPI + Server-Sent-Events dashboard on Cloud Run, triggered every few minutes by Cloud Scheduler, secrets in Secret Manager, Gemini served via Vertex AI — deployable with a single PowerShell command.
- 🎬 Chaos harness: a built-in demo injector pushes synthetic logs, metrics, business events, and Problem-triggering events into Grail so the full prevention→resolution loop can be demonstrated live.
Challenges we ran into
- Long-running agentic tool calls. Davis Copilot calls can take 60–90s; chaining them across a 5-agent graph created MCP session timeout/cleanup races we had to make resilient without losing results.
- Trustworthy natural-language → DQL. Getting the agent to generate correct DQL and act only on grounded signals — not LLM guesses — required careful prompting, schema contracts, and a structured
CycleReportthe dashboard could parse safely (including repairing the invalid JSON escapes LLMs love to emit). - Closing the loop in Dynatrace. There's no "close problem" API for ingested custom alerts, so we engineered an event-timeout strategy to keep demo Problems alive and then auto-close them the instant the agent reports a successful resolution — real bi-directional write-back, not a simulation.
- Running real LLM workloads on the cloud. Tuning Gemini concurrency and the Cloud Scheduler cadence so autonomous cycles run reliably within Vertex AI quotas.
- Make the AI legible. The hardest UX problem wasn't showing data — it was showing reasoning. SREs won't trust an autonomous agent they can't audit, so transparency became a first-class feature.
Accomplishments that we're proud of
- A genuinely autonomous prevention→resolution loop running in production on Google Cloud — not a scripted demo.
- A transparency layer that lets anyone watch the agent reason: live DQL, live Davis Copilot Q&A, and a step-by-step execution trace for every intervention.
- Real bi-directional Dynatrace integration — the agent reads telemetry and writes back, closing Problems for true dashboard ↔ platform consistency.
- A one-command deploy pipeline (Cloud Run + Scheduler + Secret Manager + IAM) that stands the whole system up from scratch.
- Concrete, business-readable impact: outages prevented, users protected, cost avoided, and lead time — the metrics an executive actually cares about.
What we learned
- Agentic design with ADK: how to decompose a hard operational problem into a coordinator + specialist sub-agents, and when thinking budgets actually pay off.
- MCP goes deep: model-context-protocol tools turn an LLM into an operator with real hands — but production reliability lives in the edge cases (timeouts, partial results, idempotency).
- Observability for the agent, not just for humans: Grail + DQL gave the agent a queryable, grounded view of reality that made autonomous decisions defensible.
- Trust is a feature. Autonomy is only adoptable if it's auditable — surfacing the why behind every action mattered as much as the action itself.
What's next for PreventX
- Real remediation connectors — Kubernetes, Cloud Run, Terraform, and CI/CD rollbacks executed against live infrastructure.
- Outcome learning — feeding resolution success/failure back into the agent so it improves which runbook it picks over time.
- Persistent, multi-tenant store so history survives restarts and scales across teams and environments.
- SLO-aware prioritization and a tunable cost model for accurate, per-org executive reporting.
- ChatOps — Slack/PagerDuty approvals and "ask PreventX" natural-language incident queries.
Built with: Google ADK · Gemini 2.5 Flash · Vertex AI · Cloud Run · Cloud Scheduler · Secret Manager · Dynatrace Grail · Davis Copilot · Dynatrace MCP Server · FastAPI · Python · React
Log in or sign up for Devpost to join the conversation.