Inspiration

When a critical service goes down, the story is never in one place. Payments sees API errors, Platform sees database strain, Identity sees auth spikes—and each team works from its own tools, dashboards, and mental model of the system. Meanwhile, on-call engineers stitch together clues in chat threads and war rooms while users wait.

That fragmentation is familiar across the industry: most IT teams still lack full visibility across hybrid environments, and poor cross-team coordination remains one of the biggest blockers to fast incident resolution. Dashboards show symptoms. They rarely deliver a clear, shared narrative of what failed first and why.

We built Incident Copilot to explore a different path: Can an agentic AI investigator pull signals from across teams, reason over real operational data, and produce one explainable timeline—while keeping humans in control of every decision?

What it does

Incident Copilot ingests a JSON alert, uses Azure OpenAI to plan an investigation, executes SPL against live Splunk data across app_logs, metrics, and security, correlates timestamps to find where the failure started, generates a root-cause analysis, and proposes a fix—then waits for engineer approval before any action. Every step is audited back to Splunk via HEC.

Our demo scenario models a real pattern: DB connection pool exhaustion (Platform) → payment-service 503s (Payments) → auth retry noise (Identity, a correlated side effect).

How we built it

Backend — Python + FastAPI

  • copilot_session.py orchestrates the agentic workflow
  • investigation_planner.py uses Azure OpenAI to order investigation steps
  • splunk_rest_client.py executes SPL via Splunk REST API (port 8089)
  • correlation_engine.py compares first metric anomaly vs first app error
  • llm_client.py handles Azure OpenAI for planning, summaries, and RCA
  • hec_client.py writes audit events to incident_copilot_audit

Frontend — Next.js 15 + Tailwind

  • Investigation dashboard with step timeline, activity log, and resolution card
  • Real-time polling of investigation state from the FastAPI API

Splunk integration

  • Splunk Enterprise indexes for multi-team telemetry
  • HTTP Event Collector for sample data ingest and audit logging
  • Splunk MCP Server supported as an optional agentic path

Data

  • sample_data/generate_checkout_platform.py produces a correlated checkout outage across three indexes

Challenges we faced

  1. Splunk connectivity — MCP required KV Store setup we couldn't complete in time, so we implemented a REST API fallback (SPLUNK_USE_MCP=false) that still runs governed SPL searches.

  2. SPL correctness — Early searches returned n/a timestamps due to invalid SPL syntax (stats ... by _time span=1m) and UTC time-window mismatches. We fixed queries with bin _time and aligned alert windows to ingested event times.

  3. Grounding the LLM — Generic AI narratives are not useful during incidents. We constrained Azure OpenAI to narrate live Splunk row data and rule-based correlation timestamps, not hardcoded templates.

  4. Governed AI — We deliberately avoided auto-remediation. The human approval gate and Splunk audit trail ensure the agent assists investigation without taking unsupervised action.

What we learned

  • Agentic observability works best when Splunk remains the source of truth and AI is the narrator/planner—not the data source.
  • Cross-index correlation (metrics before app errors) is more valuable to SREs than another dashboard panel.
  • A governed fallback path (REST when MCP is unavailable) makes the project demoable and production-realistic.
  • Human-in-the-loop is not a limitation—it is the feature that makes AI trustworthy in incident response.

What's next

  • Full Splunk MCP Server integration when KV Store is available
  • Splunk AI Assistant for SPL to broaden queries on zero-row results
  • Saved-search and webhook alert ingestion from Splunk directly
  • Runbook actions executed post-approval via approved integrations only

Built With

  • azure-openai
  • fastapi
  • next.js
  • python
  • react
  • spl
  • splunk-enterprise
  • splunk-http-event-collector-(hec)
  • splunk-mcp-server
  • splunk-rest-api
  • tailwind-css
  • typescript
Share this project:

Updates