Inspiration

Every SRE knows the worst part of an incident isn't the fix — it's the first 30 minutes of figuring out what's even happening. When checkout starts failing, on-call engineers scramble across Splunk running one-off searches: Was there a deploy? Which service? Is the database the cause or a symptom? What's the customer impact? The answers are already in Splunk — but stitching them into a coherent story is manual, high-pressure, and inconsistent, and every minute of that is lost revenue and eroded trust. We built Ops Flight Recorder on a simple conviction: the evidence is already there, so the reconstruction shouldn't be a human bottleneck.

What it does

Ops Flight Recorder is an agentic incident-command workspace for Splunk. Point it at an incident and it reconstructs the entire story automatically:

  • Gathers the evidence from Splunk — deploy markers, latency/retry/error metrics, logs, and business KPIs — through the Splunk MCP Server.
  • Builds an evidence-backed timeline of exactly what happened, and when.
  • Runs an AI agent that ranks root-cause hypotheses (each citing the precise Splunk evidence), quantifies blast radius and customer impact, recommends prioritized actions, and drafts the postmortem.

All in seconds — with Splunk as the source of truth. The model may only cite evidence IDs returned from Splunk, so it can't invent facts. It turns the most expensive minutes of an outage into a single screen, and lets any on-call engineer operate like a senior SRE.

How we built it

  • Backend: FastAPI with a clean SplunkAdapter boundary, so Splunk access is fully pluggable across three modes — demo (deterministic), REST search, and MCP.
  • Retrieval: the official Splunk MCP Server (Splunkbase app 7931) over streamable HTTP, calling its splunk_run_query tool; results normalize into one shared IncidentEvent / Evidence contract.
  • Reasoning: an AI layer that turns Splunk evidence into ranked hypotheses, actions, and a postmortem — provider-pluggable across Splunk hosted models and Claude, and grounded strictly in evidence IDs.
  • Resilience: a deterministic analysis engine as a fallback, so the workspace never breaks during a live incident — or a demo.
  • Frontend: an incident-command UI — investigation plan, evidence timeline, ranked hypotheses with confidence and scoring signals, blast radius, recommended actions, evidence explorer, and a postmortem draft.

Challenges we ran into

  • The Splunk MCP Server's security model: minting Splunk-issued JWT tokens via /services/mcp_token and speaking its POST-only streamable-HTTP transport (we disable the client's session-terminate DELETE).
  • Data normalization: ingested events land as <epoch> {json}, so our SPL strips the epoch with rex before spath to cleanly extract evidence fields.
  • Keeping the AI honest: forcing every hypothesis and postmortem reference to cite a real Splunk evidence ID — eliminating hallucination entirely.

Accomplishments that we're proud of

  • A real, end-to-end agentic loop over the official Splunk MCP Server — retrieval and reasoning — not a mockup.
  • Evidence-grounded AI: zero fabricated evidence; every claim is traceable back to a Splunk search.
  • A swappable architecture (demo → REST → MCP) that keeps the analysis and UI identical, backed by 31 passing tests and a deterministic fallback for bulletproof demos.

What we learned

  • The adapter + MCP pattern is genuinely powerful: retrieval can move from REST to the MCP Server without touching the analysis or UI.
  • Grounding an LLM in Splunk evidence IDs is the difference between a credible incident assistant and a plausible-sounding guess.
  • A great deal about the Splunk MCP Server internals — its tools, token model, and transport.

What's next for Ops Flight Recorder — AI Incident Command for Splunk

  • Real-time detection: trigger reconstruction automatically from Splunk alerts the moment an incident starts.
  • Multi-incident triage: rank and correlate concurrent incidents to focus responders.
  • ChatOps: push the timeline, root cause, and recommended actions straight into Slack.
  • Natural-language investigation via the Splunk AI Assistant (saia_*) tools, plus saved-search and knowledge-object awareness.

Built With

  • fastapi
  • html
  • javascript
  • model-context-protocol
  • pydantic
  • python
  • splunk-enterprise
  • splunk-hosted-models
  • splunk-mcp-server
  • uv
  • uvicorn
Share this project:

Updates