Splunk Agentic Ops Incident Copilot

Inspiration

Modern operations teams are overwhelmed by telemetry. A single infrastructure issue can produce many logs, alerts, and symptoms across services, which makes it hard to separate noise from the signals that matter.

This project explores a more evidence-driven approach to incident response. Instead of surfacing more raw alerts, it uses Splunk, MCP, FastAPI, Python, and Codex-assisted investigation to help teams move from detection to diagnosis and remediation faster while keeping humans in control of critical decisions.

What It Does

Splunk Agentic Ops Incident Copilot is an AgenticOps platform that turns operational telemetry into incident intelligence.

The workflow is:

  1. Python generates realistic telemetry and controlled incident bursts.
  2. Splunk ingests the JSONL logs and computes statistical baselines.
  3. Noise reduction filters routine events and highlights signals.
  4. Correlation groups related signals into incident candidates.
  5. Splunk alerts or dashboard actions can trigger a webhook into FastAPI.
  6. Python queries Splunk MCP for evidence and metadata.
  7. Codex can run a second-pass RCA when available.
  8. The app writes investigation, forecast, timeline, and remediation results back into logs.
  9. Splunk re-ingests those logs so the dashboard reflects the full incident lifecycle.

Key capabilities:

  • Statistical noise reduction before incident generation
  • Signal correlation and candidate creation
  • MCP-powered evidence collection from Splunk
  • AI-assisted RCA with deterministic fallback
  • Human-in-the-loop remediation approval
  • Incident lifecycle management
  • Splunk write-back of investigation and remediation outcomes
  • Operational dashboards for visibility and governance

How I Built It

The repo is built around a few clear layers:

  • app/telemetry.py generates normal traffic plus anomalies such as latency regressions, database timeouts, auth failures, CPU saturation, memory pressure, and deployment regressions.
  • app/main.py exposes the FastAPI endpoints for incident creation, webhook triage, investigation, approval, execution, closeout, and dashboard rendering.
  • app/splunk_mcp_client.py connects to the Splunk MCP server and calls tools such as splunk_get_info, splunk_get_indexes, splunk_get_metadata, and splunk_run_query.
  • app/llm_agent.py builds a structured RCA prompt for Codex and normalizes the returned JSON output.
  • app/decision_engine.py provides deterministic RCA and safe remediation guidance when AI is unavailable or should not be trusted alone.
  • app/storage.py writes incident, correlation, remediation, timeline, triage, forecast, and MCP metrics records to JSONL logs.
  • splunk/dashboard_simple.xml and splunk/dashboard_panels.spl define the dashboard experience and the reusable SPL panels.

Architecture

+---------------------------------------------------------------------------------+
|                         DATA GENERATION & INGESTION                             |
|  [telemetry.py] ---> (data/app.log) ---> [Splunk File Monitor] ---> [Splunk]    |
+---------------------------------------------------------------------------------+
                                                                       |
                                                                       v
+---------------------------------------------------------------------------------+
|                         SPLUNK ANALYTICS & VISUALIZATION                       |
|   [Statistical Baselines] -> [Noise Reduction] -> [Correlation & Candidate]     |
+---------------------------------------------------------------------------------+
                                                                       |
         +------------------ (Webhook / Dashboard Action) -------------+
         |
         v
+---------------------------------------------------------------------------------+
|                          FASTAPI ORCHESTRATION & AGENTIC PLANE                  |
|                      +---> [splunk_mcp_client.py] ---> (Splunk MCP Server)      |
|  [main.py] (FastAPI) |                                                          |
|                      +---> [decision_engine.py] (Deterministic RCA Fallback)    |
|                      +---> [llm_agent.py] ------> (Codex LLM Engine)            |
+---------------------------------------------------------------------------------+
         |
         v
+---------------------------------------------------------------------------------+
|                            LOG STORAGE & WRITEBACK                              |
|  (data/incidents.json) & [Splunk HEC Writeback] ---> [Splunk Dashboard Refresh] |
+---------------------------------------------------------------------------------+

The Agentic Investigation & RCA Architecture

[ FastAPI Webhook Triggered ]
                               |
                               v
               [ Step 1: Splunk MCP Discovery ]
         Query metadata, indexes, and surrounding context
                               |
                               v
               +---------------+---------------+
               |                               |
               v                               v
    [ Track A: Codex Agent ]      [ Track B: Deterministic Engine ]
    Build structured JSON prompt   Evaluate hardcoded rules & bounds
               |                               |
               +---------------+---------------+
                               |
                               v
               [ Step 3: Synthesis & Merge ]
         Fallback applied if Codex fails or deviates

State Machine & Human-In-The-Loop (HITL) Workflow

[ Open ] ---> [ Triage / Investigating ] ---> [ Pending Approval ]
                                                       |
                        +------------------------------+
                        |                              |
                        v                              v
               [ Executing Action ]             [ Rejected / Closed ]
                        |
                        v
                 [ Post-Closeout ]

Data Flow Between Components

Python to Splunk

Python writes telemetry and workflow records to disk. Splunk file inputs monitor those JSONL files and parse them into searchable events.

Splunk to Python

Splunk can hand off to FastAPI in two ways:

  • Dashboard action links call the API endpoints directly.
  • A saved correlation alert can POST a webhook payload to /webhook/splunk-alert.

The webhook handler accepts both top-level fields and nested result fields, including:

  • search_name, alert_name, or name
  • host
  • service
  • incident_id or sid
  • severity
  • trigger_time, triggered_time, or time

MCP and Codex

The app uses Splunk MCP as the evidence layer. The MCP client is responsible for:

  • startup verification with splunk_get_info
  • index discovery with splunk_get_indexes
  • metadata discovery with splunk_get_metadata
  • evidence collection with splunk_run_query
  • MCP tool metrics writeback into data/mcp_metrics.log

The Codex RCA agent:

  • builds a structured JSON prompt from incident context, evidence, and raw events
  • asks Codex to return only JSON
  • normalizes severity, confidence, root cause, and action recommendations
  • falls back to deterministic RCA if the CLI is unavailable or the output is invalid

Statistical Analysis and Correlation

Splunk computes latency baselines and z-scores to separate noise from meaningful signals.

Correlation groups events into 5-minute windows and scores them using:

  • signal count
  • unique signal types
  • host diversity
  • endpoint diversity
  • latency spikes and z-scores
  • CPU and memory pressure
  • repeated failures on the same incident id

This produces a candidate incident severity that appears in the dashboard and can trigger the webhook flow.

Alert Triggering Through Webhook

When a Splunk alert fires, the webhook path hydrates or creates the incident record, gathers surrounding evidence through Splunk MCP, runs deterministic RCA, optionally enhances the result with Codex, and writes the result back to the local logs and, when configured, to Splunk via HEC.

HEC Writeback

HEC is used to write the AI triage summary back into Splunk as a dedicated event stream.

If SPLUNK_HEC_URL and SPLUNK_HEC_TOKEN are configured, the app posts a triage event containing:

  • incident metadata
  • severity
  • confidence score
  • AI summary
  • root cause summary
  • MCP evidence summary
  • alert payload
  • writeback status

If HEC is unavailable or fails, triage still completes and the dashboard still reflects the incident.

Dashboard Features in Splunk

The dashboard is backed by splunk/dashboard_simple.xml and splunk/dashboard_panels.spl.

It shows:

  • total events and noise reduction
  • latency baseline reporting
  • errors in the last hour
  • correlated incidents
  • incident candidate scoring
  • incident table with drill-down
  • selected incident AI summary
  • MCP investigation results
  • remediation status and lifecycle
  • incident timeline
  • MCP tool usage
  • investigation source
  • AI assistant usage
  • forecast summary and forecast risk tables

The dashboard also exposes the remediation workflow:

  • Investigate
  • Approve
  • Execute
  • Reject
  • Close

Challenges I Ran Into

One of the hardest parts was keeping the workflow evidence-driven instead of turning it into a black-box AI demo.

Key challenges:

  • Integrating Splunk MCP Server with Codex due to the mismatch of Splunk JSON based config input with codex toml config setup.
  • Activating Splunk AI Assistant due to token mismatch.
  • Building statistical noise reduction without hiding important signals
  • Splunk Search lag to trigger alert for the incident with the webhook.
  • Converting raw telemetry into incident candidates
  • Integrating Splunk MCP investigation workflows cleanly
  • Designing explainable RCA generation
  • Creating a safe human-in-the-loop remediation process
  • Keeping traceability between evidence, RCA, approvals, and remediation actions
  • Making the dashboard in SPLUNK in xmL to reflect multiple stages of the incident lifecycle which needs more exploration of feature available in SPLUNK dashboard.

Another challenge was balancing automation with operational safety. The system can accelerate investigation, but remediation decisions still remain under human control.

Accomplishments

I built an end-to-end AgenticOps workflow that combines observability, AI-assisted investigation, and operational governance.

Highlights:

  • Statistical noise reduction before incident creation
  • Correlation-based incident candidate generation
  • MCP-powered evidence collection
  • Codex-assisted RCA with deterministic fallback
  • Human-governed remediation lifecycle
  • Incident state management from open to closed
  • Splunk write-back architecture
  • Operational intelligence dashboards
  • Forecast-ready architecture for future predictive capabilities

Most importantly, the project shows how AI can augment operations teams without removing human oversight from critical operational decisions.

Built With

Share this project:

Updates