Incident Copilot — Splunk AI Investigator

Home Page

Inspiration

When a critical service goes down, the story is never in one place. Payments sees API errors, Platform sees database strain, Identity sees auth spikes—and each team works from its own tools, dashboards, and mental model of the system. Meanwhile, on-call engineers stitch together clues in chat threads and war rooms while users wait.

That fragmentation is familiar across the industry: most IT teams still lack full visibility across hybrid environments, and poor cross-team coordination remains one of the biggest blockers to fast incident resolution. Dashboards show symptoms. They rarely deliver a clear, shared narrative of what failed first and why.

We built Incident Copilot to explore a different path: Can an agentic AI investigator pull signals from across teams, reason over real operational data, and produce one explainable timeline—while keeping humans in control of every decision?

What it does

Incident Copilot ingests a JSON alert, uses Azure OpenAI to plan an investigation, executes SPL against live Splunk data across app_logs, metrics, and security, correlates timestamps to find where the failure started, generates a root-cause analysis, and proposes a fix—then waits for engineer approval before any action. Every step is audited back to Splunk via HEC.

Our demo scenario models a real pattern: DB connection pool exhaustion (Platform) → payment-service 503s (Payments) → auth retry noise (Identity, a correlated side effect).

How we built it

Backend — Python + FastAPI

copilot_session.py orchestrates the agentic workflow
investigation_planner.py uses Azure OpenAI to order investigation steps
splunk_rest_client.py executes SPL via Splunk REST API (port 8089)
correlation_engine.py compares first metric anomaly vs first app error
llm_client.py handles Azure OpenAI for planning, summaries, and RCA
hec_client.py writes audit events to incident_copilot_audit

Frontend — Next.js 15 + Tailwind

Investigation dashboard with step timeline, activity log, and resolution card
Real-time polling of investigation state from the FastAPI API

Splunk integration

Splunk Enterprise indexes for multi-team telemetry
HTTP Event Collector for sample data ingest and audit logging
Splunk MCP Server supported as an optional agentic path

Data

sample_data/generate_checkout_platform.py produces a correlated checkout outage across three indexes

Challenges we faced

Splunk connectivity — MCP required KV Store setup we couldn't complete in time, so we implemented a REST API fallback (SPLUNK_USE_MCP=false) that still runs governed SPL searches.
SPL correctness — Early searches returned n/a timestamps due to invalid SPL syntax (stats ... by _time span=1m) and UTC time-window mismatches. We fixed queries with bin _time and aligned alert windows to ingested event times.
Grounding the LLM — Generic AI narratives are not useful during incidents. We constrained Azure OpenAI to narrate live Splunk row data and rule-based correlation timestamps, not hardcoded templates.
Governed AI — We deliberately avoided auto-remediation. The human approval gate and Splunk audit trail ensure the agent assists investigation without taking unsupervised action.

What we learned

Agentic observability works best when Splunk remains the source of truth and AI is the narrator/planner—not the data source.
Cross-index correlation (metrics before app errors) is more valuable to SREs than another dashboard panel.
A governed fallback path (REST when MCP is unavailable) makes the project demoable and production-realistic.
Human-in-the-loop is not a limitation—it is the feature that makes AI trustworthy in incident response.