Inspiration

On matchday, a stadium's digital nervous system — ticket scanning, payments, the fan app, venue networks — has to survive extreme, bursty load. When something degrades, the ops team drowns in a flood of correlated alerts and can't find the cause fast enough. By the time they do, fans are stranded at the gates. We wanted an agent that finds root cause causally, not by guesswork, and acts under human control.

What it does

StadiumPulse Marshal is an AIOps console for tournament SRE and venue IT-ops teams. It pulls problems and causal root-cause from Dynatrace Davis AI via the Dynatrace MCP server, correlates them against the live fixture timeline, and proposes remediation a human approves with one click. It routes alerts to PagerDuty/OpsGenie, dispatches fixes through Cloud Workflows / Ansible AWX, and ships a full enterprise surface: runbooks, escalation policies, postmortems, an SLO catalog, a multi-venue fleet command center, a Slack ChatOps bridge, cost analytics, and a Davis AI feedback loop.

How we built it

A FastAPI/Python backend with a React/Vite/TypeScript frontend. Davis AI's causal root-cause (not just correlation) reaches the agent through the Dynatrace MCP server; Google Cloud Agent Builder + Gemini drives the ops agent; everything runs on Cloud Run. Persistence is SQLAlchemy + Alembic; alerting integrates PagerDuty Events v2 and OpsGenie; remediation dispatches to Cloud Workflows, Ansible AWX, or generic webhooks. Slack ChatOps is secured with HMAC v0 signature verification. Deploy hardening uses a non-root container, Secret Manager bindings, and Workload Identity Federation.

Challenges we ran into

Making remediation safe meant a strict human-in-the-loop gate on every action. Converting the nine enterprise modules from in-memory stores to SQL-backed write-through persistence — and proving durability with live process-restart tests — took real care. Alembic autogenerate on SQLite emitted spurious operations we had to strip by hand, and async migration plumbing needed an Alembic-aware env. Schedulers had to record real next-wake timestamps rather than estimates.

Accomplishments that we're proud of

Causal root-cause surfaced through MCP and fused with the fixture timeline; nine enterprise width-modules each built model → service → RBAC routes → tests; durable SQL persistence verified across restarts; CI with Playwright smoke tests; and a verify-live harness covering every environment-dependent Definition-of-Done check, each guarded to skip cleanly when prerequisites are absent.

What we learned

Causal localization beats correlation under matchday alert noise — it's what makes agentic remediation trustworthy. And the discipline of keeping a human in the loop is what turns an "auto-remediation" demo into something an ops team would actually run.

What's next for StadiumPulse Marshal

Live Dynatrace tenant integration, broader remediation playbooks, and tighter multi-venue fleet coordination for a full tournament.

Built With

Share this project:

Updates