Inspiration

Site Reliability Engineers (SREs) lose sleep managing complex, cascading system outages manually. Traditional monitoring platforms act like loud fire alarms—they scream when something breaks at 2 AM but do absolutely nothing to fix it. We engineered SentinelOps to shift operational engineering from passive monitoring to active, closed-loop healing.

What it does

SentinelOps is an autonomous Self-Healing SRE agent pipeline that closes the operational loop entirely. Instead of just alerting, it watches system health, detects anomalies, diagnoses root causes, proposes remediation plans, enforces strict human guardrails, executes repairs, and mathematically verifies recovery.

The Memory Moat: SentinelOps stores every resolution, reducing a 38-second incident response down to just 9 seconds on the next occurrence.

How we built it

We engineered SentinelOps using a robust, multi-agent pipeline built on LangGraph to manage state transitions seamlessly.

Splunk Indexes
      |
      v
Splunk MCP Server  (primary spine)
      |
      v
+------------------+

|  Watcher Agent   |  SPL threshold scan
+--------+---------+
         |
         v
+----------------------+

| Diagnostician Agent  |  correlate logs + memory lookup
+----------+-----------+
           |
           v
+-------------------+

|  Proposer Agent   |  ranked remediation options
+---------+---------+
          |
          v
+-------------------+

|   Human Gate      |  Approve / Reject
+---------+---------+
          |
          v
+-------------------+

|  Verifier Agent   |  re-query metric -> RESOLVED / ESCALATE
+---------+---------+
          |
          v
+-------------------+

|  ChromaDB Memory  |  compound across sessions
+-------------------+

Watcher Agent: Executes deterministic SPL queries (earliest=0 for persistent evaluation) via the Splunk MCP Server to flag anomaly threshold crossings.

Diagnostician Agent: Correlates high-fidelity logs and metrics to calculate system blast radius.

Proposer Agent: Utilizes Claude 3.5 Sonnet via external API to rank remediation strategies.

Human Gate: An ironclad trust layer built into our Streamlit frontend that halts execution until an engineer clicks "Approve".

Verifier Agent: Re-queries Splunk post-fix to confirm recovery mathematically.

Memory Layer: A local ChromaDB vector store that acts as a compounding knowledge base.

Challenges we ran into

Bypassing local networking barriers to securely expose real-time infrastructure metrics to a public cloud application was a significant challenge. We resolved this by building a dual-mode system architecture: using a high-performance local MCP bridge running on port :8765 for authenticated local demo execution, combined with a secure, decoupled mock data orchestration engine (SPLUNK_USE_MOCK = "true") on Streamlit Cloud so hackathon judges can stress-test the UI flawlessly in their browsers.

Accomplishments that we're proud of

We are incredibly proud of achieving a fully closed-loop architecture that balances machine velocity with enterprise control.

Feature SentinelOps Traditional Alert System Generic LLM Chatbot
Closed-loop verification ✅ Yes ❌ No ❌ No
Compound memory ✅ Yes (ChromaDB) ❌ No ❌ No
Policy-as-code gate ✅ Yes (policy.yaml) ⚠️ Partial ❌ No
Splunk-native spine ✅ Yes (MCP + REST) ⚠️ Yes (alerts only) ❌ Optional / decorative
Emergent Cascade Finding ✅ Yes ❌ No ⚠️ Unreliable

During live testing, the agent proactively discovered an impending downstream memory cascade risk in our checkout service before any explicit rules were triggered—demonstrating genuine system reasoning rather than simple pattern matching.

What we learned

We learned that Splunk's Model Context Protocol (MCP) framework is an incredibly powerful spine for agentic workflows, transforming traditional static indexing environments into active, programmable operational tools.

What's next for SentinelOps

We plan to scale SentinelOps beyond single-node clusters by introducing distributed multi-region cluster analysis, deepening the policy-as-code YAML engine with finer granularity, and building direct integrations with Kubernetes operators to allow zero-downtime microservice rollbacks automatically upon human approval.

Built With

  • chromadb
  • fastapi
  • langgraph
  • splunk-hosted-models-(foundation-ai-security-model)
  • splunk-mcp-server
  • streamlit
Share this project:

Updates