Inspiration

Every SRE knows the pain of 3 AM alerts scrambling to check logs, metrics, and deployments just to figure out what broke. We realized 80% of incident response is just evidence gathering, and that's exactly what AI can automate. When we saw Dynatrace, Gemini, and MongoDB as sponsor technologies, the idea clicked: Dynatrace has the signals, Gemini can reason over them, and MongoDB can remember every past incident. Morpheus was born named after the god of dreams, because the goal is to let the on-call engineer sleep.

What it does

Morpheus is a fully autonomous SRE agent. It polls Dynatrace every 60 seconds, and when an anomaly is detected, it automatically investigates pulling error logs, metrics, deployment events, and service topology via DQL. All evidence is sent to Gemini, which generates ranked root cause hypotheses with confidence scores. MongoDB stores every incident as institutional memory, so similar future incidents are resolved faster. Based on confidence, Morpheus either auto-remediates (posting to Slack, creating GitHub issues) or escalates with all evidence pre-gathered cutting investigation time from 45 minutes to under 3.

How we built it

The backend is Python + FastAPI with a 7-step orchestrator pipeline: detect via Problems API → enrich topology → gather logs/metrics/deployments via DQL → search MongoDB for similar incidents → send everything to Gemini for reasoning. WebSockets stream the AI's reasoning to the frontend in real time. The frontend is Next.js 14 with a canvas-animated AI orb, force-directed topology graph, and a live reasoning stream terminal. We deployed the backend on Google Cloud Run and the frontend on Vercel, with MongoDB Atlas for the database.

Challenges we ran into

DQL's learning curve was steep getting the right queries for log correlation and metric aggregation took many iterations. Getting Gemini to produce structured, parseable output with consistent confidence scores required 15+ prompt engineering iterations. Real-time WebSocket streaming while the agent is mid-investigation needed careful async architecture. And MongoDB Atlas auth issues taught us to always URL-encode passwords and double-check network whitelisting.

Accomplishments that we're proud of

Morpheus achieves true autonomy zero human input from detection to resolution. Our MTTR target was 5 minutes; we consistently hit under 3. The live reasoning stream lets you watch the AI think step-by-step, making it transparent and auditable rather than a black box. The institutional memory system genuinely improves over time the second time a pattern appears, Morpheus recognizes it instantly.

What we learned

DQL is incredibly powerful once you get past the syntax querying logs, metrics, and events in one language is a game-changer for automation. Prompt engineering is real engineering that requires schemas, few-shot examples, and extensive testing. Autonomous agents need guardrails (confidence thresholds and escalation paths) just as much as capabilities. And mock data fallbacks are non-negotiable for any demo that depends on external APIs.

What's next for Morpheus

Next steps: automated remediation (Kubernetes rollbacks, config changes via GitOps), Gemini function calling so the AI decides which queries to run, runbook ingestion for team-specific procedures, and post-incident review auto-generation. The long-term vision is making 3 AM pages a thing of the past not by silencing alerts, but by resolving incidents before a human ever needs to wake up.

Share this project:

Updates