Ops Flight Recorder — AI Incident Command for Splunk

Architecture
Live incident reconstruction from Splunk, end to end
Evidence timeline beside AI‑ranked root causes
AI root‑cause hypotheses, each citing Splunk evidence
Auto‑drafted postmortem, ready to edit
Blast radius and the Splunk evidence explorer
Incident header: service, severity, window, confidence
Evidence served live via the Splunk MCP Server

Inspiration

Every SRE knows the worst part of an incident isn't the fix — it's the first 30 minutes of figuring out what's even happening. When checkout starts failing, on-call engineers scramble across Splunk running one-off searches: Was there a deploy? Which service? Is the database the cause or a symptom? What's the customer impact? The answers are already in Splunk — but stitching them into a coherent story is manual, high-pressure, and inconsistent, and every minute of that is lost revenue and eroded trust. We built Ops Flight Recorder on a simple conviction: the evidence is already there, so the reconstruction shouldn't be a human bottleneck.

What it does

Ops Flight Recorder is an agentic incident-command workspace for Splunk. Point it at an incident and it reconstructs the entire story automatically:

Gathers the evidence from Splunk — deploy markers, latency/retry/error metrics, logs, and business KPIs — through the Splunk MCP Server.
Builds an evidence-backed timeline of exactly what happened, and when.
Runs an AI agent that ranks root-cause hypotheses (each citing the precise Splunk evidence), quantifies blast radius and customer impact, recommends prioritized actions, and drafts the postmortem.

All in seconds — with Splunk as the source of truth. The model may only cite evidence IDs returned from Splunk, so it can't invent facts. It turns the most expensive minutes of an outage into a single screen, and lets any on-call engineer operate like a senior SRE.

How we built it

Backend: FastAPI with a clean SplunkAdapter boundary, so Splunk access is fully pluggable across three modes — demo (deterministic), REST search, and MCP.
Retrieval: the official Splunk MCP Server (Splunkbase app 7931) over streamable HTTP, calling its splunk_run_query tool; results normalize into one shared IncidentEvent / Evidence contract.
Reasoning: an AI layer that turns Splunk evidence into ranked hypotheses, actions, and a postmortem — provider-pluggable across Splunk hosted models and Claude, and grounded strictly in evidence IDs.
Resilience: a deterministic analysis engine as a fallback, so the workspace never breaks during a live incident — or a demo.
Frontend: an incident-command UI — investigation plan, evidence timeline, ranked hypotheses with confidence and scoring signals, blast radius, recommended actions, evidence explorer, and a postmortem draft.

Challenges we ran into

The Splunk MCP Server's security model: minting Splunk-issued JWT tokens via /services/mcp_token and speaking its POST-only streamable-HTTP transport (we disable the client's session-terminate DELETE).
Data normalization: ingested events land as <epoch> {json}, so our SPL strips the epoch with rex before spath to cleanly extract evidence fields.
Keeping the AI honest: forcing every hypothesis and postmortem reference to cite a real Splunk evidence ID — eliminating hallucination entirely.

Accomplishments that we're proud of

A real, end-to-end agentic loop over the official Splunk MCP Server — retrieval and reasoning — not a mockup.
Evidence-grounded AI: zero fabricated evidence; every claim is traceable back to a Splunk search.
A swappable architecture (demo → REST → MCP) that keeps the analysis and UI identical, backed by 31 passing tests and a deterministic fallback for bulletproof demos.

What we learned

The adapter + MCP pattern is genuinely powerful: retrieval can move from REST to the MCP Server without touching the analysis or UI.
Grounding an LLM in Splunk evidence IDs is the difference between a credible incident assistant and a plausible-sounding guess.
A great deal about the Splunk MCP Server internals — its tools, token model, and transport.

What's next for Ops Flight Recorder — AI Incident Command for Splunk

Real-time detection: trigger reconstruction automatically from Splunk alerts the moment an incident starts.
Multi-incident triage: rank and correlate concurrent incidents to focus responders.
ChatOps: push the timeline, root cause, and recommended actions straight into Slack.
Natural-language investigation via the Splunk AI Assistant (saia_*) tools, plus saved-search and knowledge-object awareness.

Built With

fastapi
html
javascript
model-context-protocol
pydantic
python
splunk-enterprise
splunk-hosted-models
splunk-mcp-server
uv
uvicorn

Updates

Ankit Hemant Lade started this project — Jun 14, 2026 08:29 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.