Inspiration

Every enterprise runs on dozens of interconnected microservices, and when one fails at 3 AM, a human on-call engineer is paged to manually piece together logs, dashboards, and tribal knowledge while users churn and revenue bleeds. Gartner pegs the average cost of downtime at $5,600 per minute, yet most teams still operate reactively — MTTR routinely exceeds 45 minutes and alert fatigue buries the signals that matter.

We asked a simple question: what if your infrastructure could heal itself? Splunk already holds the observability data needed to understand what's happening. We wanted to put an autonomous AI brain on top of that data — one that doesn't just alert humans, but reasons, decides, and acts like a team of expert SREs that never sleeps.

What it does

Splunk Nexus AI is an autonomous enterprise digital twin and crisis commander. It maintains a living replica of a 20+ service infrastructure and runs a swarm of 7 specialized AI agents that collaborate to keep it healthy:

  • 🌐 Digital Twin Engine — a live topology map of every service and dependency.
  • 📡 Predictive Failure Detection — spots anomalies ~2 hours before impact (89% accuracy).
  • 🔬 Autonomous Root Cause Analysis — correlates Splunk logs + graph topology to pinpoint the cause (94% confidence).
  • Crisis Commander — calculates blast radius (services, users, $ at risk) and coordinates the response.
  • 🔧 Autonomous Remediation — generates and executes runbooks, resolving 87.5% of incidents with no human in the loop.
  • 🛡️ Security Investigator — builds attack chains, identifies IOCs, and blocks threats in real time.
  • 📊 Executive Reporter — produces business-impact summaries (e.g., "$45K revenue protected") automatically.

The result: MTTD drops to 4.2 minutes, MTTR to 23 minutes (↓62%), and $147,500/week in downtime cost is avoided — all visible in a real-time Mission Control UI with full explainable-AI decision traces.

How we built it

  • Frontend — Next.js 15 (App Router) + TypeScript, with a 7-page Mission Control UI: dashboard, crisis war room, interactive SVG digital twin, agent swarm view, security investigations, executive reports, and a 3-minute demo mode.
  • Backend — FastAPI (Python 3.12, Pydantic v2) exposing REST, WebSocket, and SSE endpoints across 9 route groups, with an internal async event bus for real-time streaming.
  • AI Orchestration — LangGraph StateGraph implementing a supervisor-worker pattern. A Supervisor agent routes events to 7 specialist agents, each carrying shared AgentState and emitting a full decision trace for explainability.
  • Agent Tooling — Splunk MCP client (log/search access), Neo4j (service dependency graph + blast-radius queries), Qdrant (vector search for similar past incidents), and simulation tools.
  • Data Layer — Splunk (via MCP), Neo4j 5, Qdrant, and PostgreSQL 16 for LangGraph state persistence. Synthetic Splunk events, Cypher topology, and generators ship in data/ so the whole thing runs offline.
  • Infrastructure — Dockerized services with docker-compose, Kubernetes manifests, Prometheus + Grafana observability, GitHub Actions CI, and a Makefile for one-command setup.

Challenges we ran into

  • Multi-agent coordination — getting 7 agents to collaborate without looping or stepping on each other required a disciplined supervisor routing model and a carefully designed shared state schema.
  • Explainability — autonomous remediation is only trustworthy if every decision is auditable, so we built a decision-trace pipeline streamed live to the UI for full transparency.
  • Real-time streaming at scale — coordinating SSE, WebSocket, and an async event bus so the frontend reflects agent activity instantly without polling storms.
  • Demo determinism — making a complex autonomous system tell a clean, repeatable 3-minute story meant orchestrating phased, deterministic scenarios on top of an otherwise non-deterministic agent swarm.
  • Blast-radius accuracy — modeling service dependencies in Neo4j so impact calculations (users affected, revenue at risk) were realistic.

Accomplishments that we're proud of

  • A genuinely autonomous loop: detect → diagnose → decide → remediate → report, with humans optional rather than required.
  • 7 cooperating agents with explainable reasoning — not a single chatbot.
  • A polished, real-time Mission Control UI that makes complex AI legible to both engineers and executives.
  • Production-grade engineering: Docker, Kubernetes, CI, tests, and complete documentation — not just a demo script.
  • A self-contained, offline-runnable experience powered by synthetic data and generators.

What we learned

  • Supervisor-worker > monolith — decomposing the problem into specialist agents made the system more accurate, debuggable, and extensible.
  • Trust is a feature — autonomy without explainability is unusable in the enterprise; the decision trace turned out to be as important as the actions.
  • Graphs + vectors + logs are complementary — Neo4j for structure, Qdrant for similarity, and Splunk for ground truth together beat any single approach.
  • Streaming architecture matters — an event bus decoupling producers from consumers was essential to keeping the UI responsive under agent load.

What's next for Splunk Nexus AI

  • Phase 2 — Live Splunk MCP integration against real customer data (beyond mock mode).
  • Phase 3 — Federated multi-tenant deployment for MSPs.
  • Phase 4 — Autonomous infrastructure optimization for cost + performance, not just incident response.
  • Phase 5 — Compliance automation (SOC 2, ISO 27001) built on the existing audit trail.

Built With

Share this project:

Updates