SentinelOps — Project Story

The Problem

Modern engineering teams run dozens of microservices. When something breaks at 2 AM, the on-call engineer faces a wall of noise: thousands of Splunk alerts, no clear culprit, no idea which service is the root cause, and no safe way to fix it without risking a cascade. Mean-time-to-resolution (MTTR) stretches from minutes into hours — and every minute costs real money and user trust.

Traditional monitoring tells you that something is wrong. It doesn't tell you why, and it certainly doesn't fix it.

The Idea

What if your observability stack could think?

SentinelOps is an AI-powered incident intelligence platform built on top of Splunk. It doesn't just detect anomalies — it debates root causes, maps service dependencies, forecasts SLO breaches before they happen, and generates safe, validated remediation playbooks — all autonomously.

The core insight: a single AI agent reasoning about an incident is as unreliable as a single junior engineer guessing in the dark. So we built a multi-agent consensus system — three specialist AI agents with different cognitive styles (conservative evidence-analyst, systems-thinking lateral thinker, security specialist) debate the root cause independently, then a synthesis judge weighs their hypotheses and produces a high-confidence verdict.

How It Was Built

Phase 1 — Anomaly Detection Ensemble

We started with the signal problem. Most monitoring tools alert on raw thresholds — this produces massive alert fatigue. We built a 3-detector ensemble (ADWIN for concept drift, IQR Fence for outliers, STL Residual for seasonality removal) that votes by majority before raising an anomaly. A noise suppressor with exponential back-off silences repeat offenders. The result: fewer alerts, higher signal.

Phase 2 — Consensus RCA Engine

A single LLM call for root cause analysis was brittle. We designed three Gemini 2.5 Flash agents — Alpha (conservative), Beta (lateral/systems-thinking), Gamma (security specialist) — that each independently analyze the same incident context and produce structured JSON hypotheses. A Judge agent synthesizes their outputs, resolves conflicts, and emits a final root cause with a calibrated confidence score.

Phase 3 — Topology & Blast Radius

Root cause means nothing without knowing what else is affected. The Topology Agent mines Splunk for OpenTelemetry traces, HTTP access logs, and error propagation chains to build a live service dependency graph. It exposes RED metrics (Rate, Errors, Duration) per service and calculates blast radius — how many downstream services are at risk if a given service degrades.

Phase 4 — Vector Memory & Knowledge Graph

SentinelOps learns from every incident. Resolved incidents are stored as embeddings (Gemini Embedding-2 + ChromaDB) for semantic retrieval. An Incident Knowledge Graph tracks co-failure relationships and PageRank-based risk scores for each service — so the system gets smarter with every incident it handles.

Phase 5 — AI Remediation Engine

The most dangerous part: automated fixing. We built a three-stage safety model:

PlaybookGenerator — Gemini writes a structured JSON remediation plan (not raw shell commands)
PlaybookValidator — A second Gemini call reviews for blast radius, irreversibility, and data-loss risk
RollbackEngine — Every destructive action records its inverse; rollback triggers automatically if health checks fail post-action

Three operating modes: DRY_RUN (default for P3/P4), AUTO (requires ≥0.8 confidence + validator approval, for P1/P2), and HUMAN_APPROVAL (posts to Slack for ✅/❌ reaction before executing).

Phase 6 — SLO Forecasting & Proactive Alerts

Using time-series forecasting on Splunk metrics, SentinelOps predicts SLO breaches up to 30 minutes before they happen. An imminent breach (<15 minutes) automatically escalates any incident to P1 — regardless of log volume.

The Stack

Layer	Technology
AI Brain	Gemini 2.5 Flash (generation + embeddings)
Log Source	Splunk (search, HEC ingest, webhook alerts)
Vector Memory	ChromaDB 1.5.9
Knowledge Graph	In-process NetworkX / PageRank
Anomaly Detection	statsmodels STL + custom ADWIN/IQR
Backend	Python async (asyncio + FastAPI)
Notifications	Slack SDK + PagerDuty API
Infrastructure	Docker Compose (Splunk, Chroma, Postgres, Prometheus, UI)
Dashboard	AI-generated Grafana panels via DashboardBuilder

What Makes It Different

Multi-agent consensus — not one LLM guessing, three debating with a judge
Graph-aware severity — PageRank risk scores, not just log volume thresholds
Safe auto-remediation — structured playbooks, validator, rollback — never raw commands
Learns from history — semantic memory + knowledge graph grow with every incident
Proactive, not reactive — SLO breach forecasting catches problems before users feel them

What's Next

Deeper OpenTelemetry integration (traces → spans → causal chain)
Multi-cloud support (AWS CloudWatch, GCP Cloud Logging as additional ingest sources)
Fine-tuned domain-specific model for SRE reasoning
Human-in-the-loop reinforcement learning from remediation feedback

Built With

2.5
ai
api
asyncio
chromadb
embedding-2
fastapi
flash
gemini
gen
google
networkx
pagerduty
postgresql
prometheus
pydantic
python
sdk
slack
splunk
statsmodels
structlog

Updates

Himanshu Singla started this project — Jun 15, 2026 11:50 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.