Splunk Agentic Ops 🔍

Hackathon: Splunk Agentic Ops Hackathon — Observability Track

Live Demo: https://splunk-agentic-ops.vercel.app

Repo: github.com/kyisaiah47/splunk-agentic-ops


Inspiration

On-call engineers spend the first 15–30 minutes of any incident doing the same thing: running SPL queries, checking baselines, hunting for a recent deploy, correlating hosts. It's repetitive, stressful, and happens at 3am. I wanted to automate exactly that loop — not with a static runbook, but with an AI agent that reasons like a senior SRE.


What It Does

When a Splunk alert fires, Splunk Agentic Ops takes over the investigation autonomously:

  • Triage — confirms the alert, identifies which hosts and services are affected, and scopes the blast radius
  • Baseline Comparison — compares current error rates or latency against a 7-day rolling baseline at the same hour to distinguish noise from a real incident
  • Root Cause Investigation — checks deploy logs for recent releases, hunts for correlated infra events (OOM kills, CPU spikes, disk pressure), and traces upstream dependency failures
  • Structured Report — produces a root cause, confidence score (0–100%), first-seen timestamp, affected hosts, and a specific remediation recommendation
  • Slack Notification — posts a full Block Kit incident report automatically, no human needed
  • Live Dashboard — a real-time command center showing all investigations, their status, and results as they complete

The entire loop — alert fires to Slack report — completes in under 30 seconds.


How I Built It

Architecture

Splunk Enterprise
  ├── web_logs / app_logs / deploy_logs (HEC ingest)
  └── Saved Search Alert
            │  Webhook
            ▼
FastAPI Backend (Python)
  POST /webhook/splunk
            │
            ▼
  Background Task: investigate_alert()
            │
            ▼
Claude claude-sonnet-4-6 (Anthropic API)
  4-phase SRE Investigation Playbook
  Agentic tool-use loop (max 8 SPL calls)
        │
        ├── run_spl_search()
        ├── get_alert_context()
        └── get_index_summary()
        │
        ▼
  Splunk MCP Server  ──▶  Splunk Enterprise
  (or REST API fallback)
        │
        ▼
  Structured JSON Report
        │
        ├──▶  Slack (Block Kit)
        └──▶  Next.js Dashboard (real-time polling)

Claude doesn't execute a fixed script. It reads the incoming alert, decides which questions to answer, writes the SPL queries, and adjusts its investigation based on what the data actually shows — exactly like a human SRE would.

Tech Stack

Layer Technology
AI Model Claude claude-sonnet-4-6 (Anthropic)
Agent Framework Anthropic Tool Use API (agentic loop)
Splunk Integration Splunk MCP Server (primary) / Splunk REST API (fallback)
Backend Python 3.14, FastAPI, uvicorn
Frontend Next.js 16, Tailwind CSS v4, shadcn/ui
Notifications Slack Block Kit (incoming webhook)
Log Ingest Splunk HEC (HTTP Event Collector)

Splunk + MCP Integration

Splunk is the data source and the alerting engine — every investigation starts from a real Splunk saved-search webhook and every SPL query runs against live indexed data.

The MCP Server integration turns Splunk into a first-class AI tool surface:

  • The agent calls run_spl_search with a specific SPL query and time range — the MCP Server executes it against Splunk and returns structured results
  • get_alert_context fetches the triggering saved search definition so the agent understands what threshold was crossed
  • get_index_summary gives the agent metadata about the index before it starts querying

A REST API fallback using splunklib provides identical functionality when the MCP Server isn't available, making the system production-ready regardless of deployment environment.

The 8-query budget forces the agent to investigate like a skilled SRE — targeted and decisive, not exhaustive.


Challenges

  • Agentic budget management: Claude would sometimes exhaust all 8 tool calls before synthesizing a report. Solved with a hard limit that injects a "synthesize now" instruction when the budget is hit, guaranteeing a report even on complex incidents.
  • Splunk timestamp parsing: Splunk returns UTC timestamps without timezone offsets, causing JavaScript to misinterpret them as local time. Fixed by normalizing all ISO strings before parsing.
  • Port conflicts: Splunk Web occupies port 8000 by default, which is also the Next.js dev convention. The backend runs on 9000 to avoid the conflict — worth documenting for anyone running locally.
  • Tool result fidelity: When Splunk indexes have sparse data, Claude would sometimes hallucinate affected hosts. Prompt engineering — specifically instructing it to only report hosts that appear in actual query results — eliminated this.

What's Next

  • Persistent investigation store (Postgres) for long-term incident history and trend analysis
  • Auto-create Splunk incidents and annotate dashboards when a root cause is confirmed
  • Multi-alert correlation — link related investigations that fire within the same time window
  • Confidence calibration using historical investigation outcomes as feedback

Demo Scenarios

Trigger "High 5xx Error Rate" — the agent finds a v2.4.1 deploy to web-03 and web-07 two minutes before the spike and recommends an immediate rollback.

Trigger "P99 Latency SLO Breach" — the agent correlates slow response times across all 8 hosts, rules out deploys, and pages the infra oncall for a connection pool investigation.

Trigger "Database Connection Pool Exhausted" — the agent traces cascading failures from orders_db back to a traffic spike on /api/checkout and recommends a circuit breaker.


Team

Built solo by @kyisaiah47 for the Splunk Agentic Ops Hackathon, Observability Track.

Built With

Share this project:

Updates