Inspiration
The 3 a.m. pager. When production breaks, an on-call engineer juggles five dashboards looking through metrics, logs, deploys, traces, business impact; all under pressure, while revenue bleeds by the minute. Most "AI for ops" is a chatbot you still have to babysit. I wanted to go beyond the chatbot: an agent that actually runs the investigation end-to-end, like a seasoned incident commander, while the human stays in control.
What it does
CrisisPilot is an autonomous AI incident commander. When an incident fires, it spins up a war-room of five specialised agents that work a complex, multi-step mission in parallel:
- Metrics Agent queries live observability data straight from Dynatrace
- Deployment Agent correlates the anomaly with recent deploys
- Business Impact Agent quantifies the revenue and customers at risk
- Root Cause Agent synthesises every finding into one diagnosis
- Comms Agent drafts a remediation, proposed for human approval (never auto-executed)
Everything streams live to an operations command center, agent reasoning token-by-token, a confidence score that climbs as evidence accumulates, a reasoning graph, and an incident timeline. The human stays in command: remediations require approval before anything happens.
How I built it
- Gemini generates all agent reasoning (
gemini-flash-lite-latest), streamed token-by-token. - Google Cloud Agent Builder (ADK) has each of the 5 agents is a real
LlmAgentexecuted through ADK'sRunner. - Dynatrace MCP does agents query Dynatrace Grail via the official Dynatrace MCP server (
execute_dql) over stdio: real DQL for metrics and deployment events. - FastAPI + WebSockets backend with an async pub/sub event bus broadcasting agent cognition live.
- Next.js 15 dashboard (App Router, Tailwind, Framer Motion, React Flow).
- MongoDB Atlas persistence (Motor) with an in-memory fallback adapter.
- Deployed: backend on Render (a Docker image bundling Python + Node so the MCP server runs in-container), frontend on Vercel.
Challenges I ran into
- Real MCP, not stubs. Wiring the Dynatrace MCP server over stdio, discovering its tools at runtime, and feeding live DQL results into agent reasoning with graceful fallback when data is empty.
- Keeping the live-streaming UX while migrating to ADK. I routed ADK's SSE token stream into our existing WebSocket event model so the "watch it think" experience survived the migration.
- Dependency gravity.
google-adkpulls newer FastAPI/Starlette/pydantic; I pinned a consistent set so MCP, ADK, and FastAPI coexist cleanly. - Two runtimes, one container. The backend needs both Python and Node (for the
npx-based MCP server) solved with a custom Dockerfile. - Graceful degradation everywhere. If a credential is missing or rate-limited, agents fall back to scripted-but-realistic reasoning, so the demo never hard-fails.
Accomplishments that I'm proud of
- A genuinely autonomous, multi-agent investigation not a chatbot.
- All three required integrations are real and live in production (
/healthzshows them green): Gemini, ADK/Agent Builder, and the Dynatrace MCP server. - A human-in-command safety model where agents do the work, humans approve the fix.
- A polished, real-time war-room UX that makes agent cognition legible at a glance.
What I learned
- MCP makes partner integrations composable where an agent names a tool and the data shows up; adding or swapping a partner is a small change.
- ADK brings structure to multi-agent orchestration without sacrificing live streaming.
- The hard part of agentic ops isn't the model, it's trustworthy autonomy: provenance, confidence, and human oversight.
What's next
- Real remediation actions behind approval (roll back a deploy, scale a service) via additional partner MCPs.
- Learning from resolved incidents to pre-empt recurring failures.
- Slack / PagerDuty integration so the war-room comes to the responder.
- Fleet-scale, multi-tenant deployment.
Built With
- docker
- dynatrace
- dynatrace-mcp
- fastapi
- framer-motion
- gemini
- google-adk
- google-cloud-agent-builder
- model-context-protocol
- mongodb
- mongodb-atlas
- motor
- next.js
- python
- react
- render
- tailwindcss
- typescript
- uvicorn
- vercel
- websockets


Log in or sign up for Devpost to join the conversation.