PreventX

PreventX — Autonomous AI SRE 🛡️

Stop firefighting. Start preventing. PreventX is a multi-agent AI Site Reliability Engineer that watches your stack through Dynatrace Grail, predicts failures, and self-heals them — in seconds, before a single user is impacted and before a single alert fires.

Inspiration

Every SRE knows the 3 AM page. By the time an alert fires, users are already hurting — abandoned checkouts, failed payments, churned customers. Industry surveys put unplanned downtime at up to $9,000 per minute, yet the entire monitoring industry is fundamentally reactive: it tells you something broke after it broke.

We asked a different question: what if an autonomous agent could act on the rising signal — before the threshold is crossed, before Davis even opens a Problem, before anyone gets paged? That's the leap from monitoring to prevention, and it's exactly what modern agentic AI makes possible.

What it does

PreventX runs a continuous, fully autonomous SRE loop with two modes:

🟢 Prevention — it scans Grail with natural-language-generated DQL, spots anomalies that are rising but haven't tripped an alert yet, predicts the blast radius, and runs a remediation runbook (cache flush, rolling restart, autoscale, connection drain, rollback…) before the outage materializes.
🔴 Resolution — when a real Dynatrace Problem does open, PreventX detects it, asks Davis Copilot for root cause, executes the fix, and closes the loop back in Dynatrace so the Problem moves to Closed — no human in the loop.

Everything is shown on a live operator dashboard with a transparency layer: you can literally watch the agent think — every DQL query it generated, every Davis Copilot Q&A, every decision, with measured lead time, users protected, and cost avoided. In one demo cycle: 0 users impacted, $128K refund exposure avoided, 47-second autonomous MTTR.

How we built it

🤖 Multi-agent core — Google Agent Development Kit (ADK): a Gemini 2.5 Flash coordinator (with the built-in planner / thinking budget) orchestrates five specialist sub-agents — anomaly detector, topology analyzer, RUM analyzer, prevention executor, and incident resolver — each a focused expert with its own toolset.
🔌 Dynatrace MCP integration: every sub-agent is wired to the official Dynatrace MCP Server, using 18 tools across Grail, Davis, and Davis Copilot — execute_dql, generate_dql_from_natural_language, chat_with_davis_copilot, list_problems, create_dynatrace_notebook, and more.
🧠 Grail + DQL + Davis Copilot: the agent turns natural-language intent into DQL, queries the Grail data lake, and uses Davis Copilot for root-cause reasoning — grounding every decision in real telemetry, not hallucination.
☁️ Google Cloud, end to end: FastAPI + Server-Sent-Events dashboard on Cloud Run, triggered every few minutes by Cloud Scheduler, secrets in Secret Manager, Gemini served via Vertex AI — deployable with a single PowerShell command.
🎬 Chaos harness: a built-in demo injector pushes synthetic logs, metrics, business events, and Problem-triggering events into Grail so the full prevention→resolution loop can be demonstrated live.

Challenges we ran into

Long-running agentic tool calls. Davis Copilot calls can take 60–90s; chaining them across a 5-agent graph created MCP session timeout/cleanup races we had to make resilient without losing results.
Trustworthy natural-language → DQL. Getting the agent to generate correct DQL and act only on grounded signals — not LLM guesses — required careful prompting, schema contracts, and a structured CycleReport the dashboard could parse safely (including repairing the invalid JSON escapes LLMs love to emit).
Closing the loop in Dynatrace. There's no "close problem" API for ingested custom alerts, so we engineered an event-timeout strategy to keep demo Problems alive and then auto-close them the instant the agent reports a successful resolution — real bi-directional write-back, not a simulation.
Running real LLM workloads on the cloud. Tuning Gemini concurrency and the Cloud Scheduler cadence so autonomous cycles run reliably within Vertex AI quotas.
Make the AI legible. The hardest UX problem wasn't showing data — it was showing reasoning. SREs won't trust an autonomous agent they can't audit, so transparency became a first-class feature.

Accomplishments that we're proud of

A genuinely autonomous prevention→resolution loop running in production on Google Cloud — not a scripted demo.
A transparency layer that lets anyone watch the agent reason: live DQL, live Davis Copilot Q&A, and a step-by-step execution trace for every intervention.
Real bi-directional Dynatrace integration — the agent reads telemetry and writes back, closing Problems for true dashboard ↔ platform consistency.
A one-command deploy pipeline (Cloud Run + Scheduler + Secret Manager + IAM) that stands the whole system up from scratch.
Concrete, business-readable impact: outages prevented, users protected, cost avoided, and lead time — the metrics an executive actually cares about.

What we learned

Agentic design with ADK: how to decompose a hard operational problem into a coordinator + specialist sub-agents, and when thinking budgets actually pay off.
MCP goes deep: model-context-protocol tools turn an LLM into an operator with real hands — but production reliability lives in the edge cases (timeouts, partial results, idempotency).
Observability for the agent, not just for humans: Grail + DQL gave the agent a queryable, grounded view of reality that made autonomous decisions defensible.
Trust is a feature. Autonomy is only adoptable if it's auditable — surfacing the why behind every action mattered as much as the action itself.

What's next for PreventX

Real remediation connectors — Kubernetes, Cloud Run, Terraform, and CI/CD rollbacks executed against live infrastructure.
Outcome learning — feeding resolution success/failure back into the agent so it improves which runbook it picks over time.
Persistent, multi-tenant store so history survives restarts and scales across teams and environments.
SLO-aware prioritization and a tunable cost model for accurate, per-org executive reporting.
ChatOps — Slack/PagerDuty approvals and "ask PreventX" natural-language incident queries.

Built with: Google ADK · Gemini 2.5 Flash · Vertex AI · Cloud Run · Cloud Scheduler · Secret Manager · Dynatrace Grail · Davis Copilot · Dynatrace MCP Server · FastAPI · Python · React

Built With

cloud-scheduler
davis-copilot
docker
dql
dynatrace
fastapi
gemini
google-adk
google-cloud-run
grail
jinja
mcp
python
react
secret-manager
server-sent-events
uvicorn
vertex-ai

Updates

Lamogo Junior Soro started this project — Jun 11, 2026 04:55 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.