Splunk Home Dashboard
Project Architecture and Flow
Splunk AI assistant
Splunk MCP Server Dashboard
Noise, Errors and Baseline Latency Panel
Correlated Incidents and Incident candidate Alert
Incident table and HITL Remediation Flow
Splunk Agentic Ops Dashboard
Fast API Dashboard
SPL Query Generation on the incident for Analysis in Splunk AI Assistant
Remediation status and MCP Investigation Panel
MCP Tool, AI Assitant Usage and Forecast Table
Investigation Report

Splunk Agentic Ops Incident Copilot

Inspiration

Modern operations teams are overwhelmed by telemetry. A single infrastructure issue can produce many logs, alerts, and symptoms across services, which makes it hard to separate noise from the signals that matter.

This project explores a more evidence-driven approach to incident response. Instead of surfacing more raw alerts, it uses Splunk, MCP, FastAPI, Python, and Codex-assisted investigation to help teams move from detection to diagnosis and remediation faster while keeping humans in control of critical decisions.

What It Does

Splunk Agentic Ops Incident Copilot is an AgenticOps platform that turns operational telemetry into incident intelligence.

The workflow is:

Python generates realistic telemetry and controlled incident bursts.
Splunk ingests the JSONL logs and computes statistical baselines.
Noise reduction filters routine events and highlights signals.
Correlation groups related signals into incident candidates.
Splunk alerts or dashboard actions can trigger a webhook into FastAPI.
Python queries Splunk MCP for evidence and metadata.
Codex can run a second-pass RCA when available.
The app writes investigation, forecast, timeline, and remediation results back into logs.
Splunk re-ingests those logs so the dashboard reflects the full incident lifecycle.

Key capabilities:

Statistical noise reduction before incident generation
Signal correlation and candidate creation
MCP-powered evidence collection from Splunk
AI-assisted RCA with deterministic fallback
Human-in-the-loop remediation approval
Incident lifecycle management
Splunk write-back of investigation and remediation outcomes
Operational dashboards for visibility and governance

How I Built It

The repo is built around a few clear layers:

app/telemetry.py generates normal traffic plus anomalies such as latency regressions, database timeouts, auth failures, CPU saturation, memory pressure, and deployment regressions.
app/main.py exposes the FastAPI endpoints for incident creation, webhook triage, investigation, approval, execution, closeout, and dashboard rendering.
app/splunk_mcp_client.py connects to the Splunk MCP server and calls tools such as splunk_get_info, splunk_get_indexes, splunk_get_metadata, and splunk_run_query.
app/llm_agent.py builds a structured RCA prompt for Codex and normalizes the returned JSON output.
app/decision_engine.py provides deterministic RCA and safe remediation guidance when AI is unavailable or should not be trusted alone.
app/storage.py writes incident, correlation, remediation, timeline, triage, forecast, and MCP metrics records to JSONL logs.
splunk/dashboard_simple.xml and splunk/dashboard_panels.spl define the dashboard experience and the reusable SPL panels.

Architecture

+---------------------------------------------------------------------------------+
|                         DATA GENERATION & INGESTION                             |
|  [telemetry.py] ---> (data/app.log) ---> [Splunk File Monitor] ---> [Splunk]    |
+---------------------------------------------------------------------------------+
                                                                       |
                                                                       v
+---------------------------------------------------------------------------------+
|                         SPLUNK ANALYTICS & VISUALIZATION                       |
|   [Statistical Baselines] -> [Noise Reduction] -> [Correlation & Candidate]     |
+---------------------------------------------------------------------------------+
                                                                       |
         +------------------ (Webhook / Dashboard Action) -------------+
         |
         v
+---------------------------------------------------------------------------------+
|                          FASTAPI ORCHESTRATION & AGENTIC PLANE                  |
|                      +---> [splunk_mcp_client.py] ---> (Splunk MCP Server)      |
|  [main.py] (FastAPI) |                                                          |
|                      +---> [decision_engine.py] (Deterministic RCA Fallback)    |
|                      +---> [llm_agent.py] ------> (Codex LLM Engine)            |
+---------------------------------------------------------------------------------+
         |
         v
+---------------------------------------------------------------------------------+
|                            LOG STORAGE & WRITEBACK                              |
|  (data/incidents.json) & [Splunk HEC Writeback] ---> [Splunk Dashboard Refresh] |
+---------------------------------------------------------------------------------+

The Agentic Investigation & RCA Architecture

[ FastAPI Webhook Triggered ]
                               |
                               v
               [ Step 1: Splunk MCP Discovery ]
         Query metadata, indexes, and surrounding context
                               |
                               v
               +---------------+---------------+
               |                               |
               v                               v
    [ Track A: Codex Agent ]      [ Track B: Deterministic Engine ]
    Build structured JSON prompt   Evaluate hardcoded rules & bounds
               |                               |
               +---------------+---------------+
                               |
                               v
               [ Step 3: Synthesis & Merge ]
         Fallback applied if Codex fails or deviates

State Machine & Human-In-The-Loop (HITL) Workflow

[ Open ] ---> [ Triage / Investigating ] ---> [ Pending Approval ]
                                                       |
                        +------------------------------+
                        |                              |
                        v                              v
               [ Executing Action ]             [ Rejected / Closed ]
                        |
                        v
                 [ Post-Closeout ]

Data Flow Between Components

Python to Splunk

Python writes telemetry and workflow records to disk. Splunk file inputs monitor those JSONL files and parse them into searchable events.

Splunk to Python

Splunk can hand off to FastAPI in two ways:

Dashboard action links call the API endpoints directly.
A saved correlation alert can POST a webhook payload to /webhook/splunk-alert.

The webhook handler accepts both top-level fields and nested result fields, including:

search_name, alert_name, or name
host
service
incident_id or sid
severity
trigger_time, triggered_time, or time

MCP and Codex

The app uses Splunk MCP as the evidence layer. The MCP client is responsible for:

startup verification with splunk_get_info
index discovery with splunk_get_indexes
metadata discovery with splunk_get_metadata
evidence collection with splunk_run_query
MCP tool metrics writeback into data/mcp_metrics.log

The Codex RCA agent:

builds a structured JSON prompt from incident context, evidence, and raw events
asks Codex to return only JSON
normalizes severity, confidence, root cause, and action recommendations
falls back to deterministic RCA if the CLI is unavailable or the output is invalid

Statistical Analysis and Correlation

Splunk computes latency baselines and z-scores to separate noise from meaningful signals.

Correlation groups events into 5-minute windows and scores them using:

signal count
unique signal types
host diversity
endpoint diversity
latency spikes and z-scores
CPU and memory pressure
repeated failures on the same incident id

This produces a candidate incident severity that appears in the dashboard and can trigger the webhook flow.

Alert Triggering Through Webhook

When a Splunk alert fires, the webhook path hydrates or creates the incident record, gathers surrounding evidence through Splunk MCP, runs deterministic RCA, optionally enhances the result with Codex, and writes the result back to the local logs and, when configured, to Splunk via HEC.

HEC Writeback

HEC is used to write the AI triage summary back into Splunk as a dedicated event stream.

If SPLUNK_HEC_URL and SPLUNK_HEC_TOKEN are configured, the app posts a triage event containing:

incident metadata
severity
confidence score
AI summary
root cause summary
MCP evidence summary
alert payload
writeback status

If HEC is unavailable or fails, triage still completes and the dashboard still reflects the incident.

Dashboard Features in Splunk

The dashboard is backed by splunk/dashboard_simple.xml and splunk/dashboard_panels.spl.

It shows:

total events and noise reduction
latency baseline reporting
errors in the last hour
correlated incidents
incident candidate scoring
incident table with drill-down
selected incident AI summary
MCP investigation results
remediation status and lifecycle
incident timeline
MCP tool usage
investigation source
AI assistant usage
forecast summary and forecast risk tables

The dashboard also exposes the remediation workflow:

Investigate
Approve
Execute
Reject
Close

Challenges I Ran Into

One of the hardest parts was keeping the workflow evidence-driven instead of turning it into a black-box AI demo.

Key challenges:

Integrating Splunk MCP Server with Codex due to the mismatch of Splunk JSON based config input with codex toml config setup.
Activating Splunk AI Assistant due to token mismatch.
Building statistical noise reduction without hiding important signals
Splunk Search lag to trigger alert for the incident with the webhook.
Converting raw telemetry into incident candidates
Integrating Splunk MCP investigation workflows cleanly
Designing explainable RCA generation
Creating a safe human-in-the-loop remediation process
Keeping traceability between evidence, RCA, approvals, and remediation actions
Making the dashboard in SPLUNK in xmL to reflect multiple stages of the incident lifecycle which needs more exploration of feature available in SPLUNK dashboard.

Another challenge was balancing automation with operational safety. The system can accelerate investigation, but remediation decisions still remain under human control.

Accomplishments

I built an end-to-end AgenticOps workflow that combines observability, AI-assisted investigation, and operational governance.

Highlights:

Statistical noise reduction before incident creation
Correlation-based incident candidate generation
MCP-powered evidence collection
Codex-assisted RCA with deterministic fallback
Human-governed remediation lifecycle
Incident state management from open to closed
Splunk write-back architecture
Operational intelligence dashboards
Forecast-ready architecture for future predictive capabilities

Most importantly, the project shows how AI can augment operations teams without removing human oversight from critical operational decisions.

Built With

3.2
apis
code
codex
enterprise
fastapi
git
github
json
llama
mcp
ollama
python
rcaagent
rest
server
splunk
uvicorn
vs
windows

Updates

Ravindra Kumar Venkata Raju Adapa started this project — Jun 14, 2026 10:43 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.