SentinelOps — Project Story
The Problem
Modern engineering teams run dozens of microservices. When something breaks at 2 AM, the on-call engineer faces a wall of noise: thousands of Splunk alerts, no clear culprit, no idea which service is the root cause, and no safe way to fix it without risking a cascade. Mean-time-to-resolution (MTTR) stretches from minutes into hours — and every minute costs real money and user trust.
Traditional monitoring tells you that something is wrong. It doesn't tell you why, and it certainly doesn't fix it.
The Idea
What if your observability stack could think?
SentinelOps is an AI-powered incident intelligence platform built on top of Splunk. It doesn't just detect anomalies — it debates root causes, maps service dependencies, forecasts SLO breaches before they happen, and generates safe, validated remediation playbooks — all autonomously.
The core insight: a single AI agent reasoning about an incident is as unreliable as a single junior engineer guessing in the dark. So we built a multi-agent consensus system — three specialist AI agents with different cognitive styles (conservative evidence-analyst, systems-thinking lateral thinker, security specialist) debate the root cause independently, then a synthesis judge weighs their hypotheses and produces a high-confidence verdict.
How It Was Built
Phase 1 — Anomaly Detection Ensemble
We started with the signal problem. Most monitoring tools alert on raw thresholds — this produces massive alert fatigue. We built a 3-detector ensemble (ADWIN for concept drift, IQR Fence for outliers, STL Residual for seasonality removal) that votes by majority before raising an anomaly. A noise suppressor with exponential back-off silences repeat offenders. The result: fewer alerts, higher signal.
Phase 2 — Consensus RCA Engine
A single LLM call for root cause analysis was brittle. We designed three Gemini 2.5 Flash agents — Alpha (conservative), Beta (lateral/systems-thinking), Gamma (security specialist) — that each independently analyze the same incident context and produce structured JSON hypotheses. A Judge agent synthesizes their outputs, resolves conflicts, and emits a final root cause with a calibrated confidence score.
Phase 3 — Topology & Blast Radius
Root cause means nothing without knowing what else is affected. The Topology Agent mines Splunk for OpenTelemetry traces, HTTP access logs, and error propagation chains to build a live service dependency graph. It exposes RED metrics (Rate, Errors, Duration) per service and calculates blast radius — how many downstream services are at risk if a given service degrades.
Phase 4 — Vector Memory & Knowledge Graph
SentinelOps learns from every incident. Resolved incidents are stored as embeddings (Gemini Embedding-2 + ChromaDB) for semantic retrieval. An Incident Knowledge Graph tracks co-failure relationships and PageRank-based risk scores for each service — so the system gets smarter with every incident it handles.
Phase 5 — AI Remediation Engine
The most dangerous part: automated fixing. We built a three-stage safety model:
- PlaybookGenerator — Gemini writes a structured JSON remediation plan (not raw shell commands)
- PlaybookValidator — A second Gemini call reviews for blast radius, irreversibility, and data-loss risk
- RollbackEngine — Every destructive action records its inverse; rollback triggers automatically if health checks fail post-action
Three operating modes: DRY_RUN (default for P3/P4), AUTO (requires ≥0.8 confidence + validator approval, for P1/P2), and HUMAN_APPROVAL (posts to Slack for ✅/❌ reaction before executing).
Phase 6 — SLO Forecasting & Proactive Alerts
Using time-series forecasting on Splunk metrics, SentinelOps predicts SLO breaches up to 30 minutes before they happen. An imminent breach (<15 minutes) automatically escalates any incident to P1 — regardless of log volume.
The Stack
| Layer | Technology |
|---|---|
| AI Brain | Gemini 2.5 Flash (generation + embeddings) |
| Log Source | Splunk (search, HEC ingest, webhook alerts) |
| Vector Memory | ChromaDB 1.5.9 |
| Knowledge Graph | In-process NetworkX / PageRank |
| Anomaly Detection | statsmodels STL + custom ADWIN/IQR |
| Backend | Python async (asyncio + FastAPI) |
| Notifications | Slack SDK + PagerDuty API |
| Infrastructure | Docker Compose (Splunk, Chroma, Postgres, Prometheus, UI) |
| Dashboard | AI-generated Grafana panels via DashboardBuilder |
What Makes It Different
- Multi-agent consensus — not one LLM guessing, three debating with a judge
- Graph-aware severity — PageRank risk scores, not just log volume thresholds
- Safe auto-remediation — structured playbooks, validator, rollback — never raw commands
- Learns from history — semantic memory + knowledge graph grow with every incident
- Proactive, not reactive — SLO breach forecasting catches problems before users feel them
What's Next
- Deeper OpenTelemetry integration (traces → spans → causal chain)
- Multi-cloud support (AWS CloudWatch, GCP Cloud Logging as additional ingest sources)
- Fine-tuned domain-specific model for SRE reasoning
- Human-in-the-loop reinforcement learning from remediation feedback
Log in or sign up for Devpost to join the conversation.