Splunk SRE Agent

Inspiration

Modern SRE teams have more telemetry than ever, but during incidents the hard part is still turning logs, metrics, traces, alerts, and service events into a clear operational story. We wanted to build an AI assistant that lives directly inside Splunk, uses Splunk as the evidence source of truth, and helps teams move from “something looks wrong” to “here is what changed, what is impacted, and what we can safely do next.” Splunk SRE Agent was inspired by the Observability track: helping engineering, ITOps, and NetOps teams understand system behavior, detect anomalies earlier, and automate the first layer of operational response with AI while keeping humans in control of risky actions.

What it does

Splunk SRE Agent is a native Splunk app for AI-assisted observability investigation. It provides a Splunk Web chat UI where users can ask SRE questions such as latency spikes, error bursts, saturation signals, deployment impact, or service health summaries. The agent connects to a custom OpenAI-compatible LLM provider and uses Splunk MCP to run read-only Splunk searches for evidence. It can: Analyze service health from Splunk data. Detect anomaly candidates across logs, metrics, traces, and events. Use splunk_run_query through MCP for evidence-backed answers. Save chat sessions and generated SRE reports. Produce structured reports with timeline, root cause hypothesis, customer impact, confidence, and recommended next actions. Separate safe diagnostic suggestions from actions that require human approval.

How we built it

We built Splunk SRE Agent as a native Splunk app, following the Splunk Commander architecture pattern. The frontend is a static Splunk Web UI built with HTML, CSS, and JavaScript. The backend is a persistent Python REST handler exposed through Splunk restmap.conf. App state is stored in Splunk KV Store, while model and MCP credentials are stored securely through Splunk storage/passwords. The AI layer uses a custom OpenAI-compatible chat completions provider, so the app can work with different model backends. For Splunk access, the agent uses Splunk MCP and a guarded splunk_run_query tool. We added read-only SPL validation so the model can gather evidence without performing destructive or mutating operations. The app is packaged as a deployable .spl file and includes documentation, examples, setup instructions, an architecture diagram, and an MIT license.

Challenges we ran into

One major challenge was making the app behave correctly inside Splunk Web, not just in a local static preview. Splunk injects i18n translation wrappers into app static JavaScript files, which caused scripts to fail before button handlers could bind. We fixed this by adding the right global i18n_register shim and cache-busting the Splunk static assets. Another challenge was balancing agent autonomy with operational safety. We wanted the assistant to be useful during incidents, but not dangerous. That meant enforcing read-only Splunk queries, blocking destructive SPL patterns, avoiding hardcoded secrets, and clearly marking actions that require human approval. We also had to design the workflow so answers were evidence-backed rather than generic LLM summaries. The agent needed to call MCP tools, inspect Splunk rows, cite the evidence, and turn that into a useful SRE narrative.

Accomplishments that we're proud of

We are proud that Splunk SRE Agent feels like a real Splunk-native workflow rather than an external chatbot bolted onto the side. The app has: A native Splunk Web experience. Custom LLM provider support. Splunk MCP integration. Secure credential storage. Read-only guardrails for operational safety. Persistent chat and report history. Structured SRE report generation. Clear documentation, setup instructions, examples, license, and architecture diagram. We are also proud of solving the Splunk-specific static asset issue and verifying that the UI buttons work in Splunk-served pages, not only in local preview.

What we learned

We learned that agentic observability is most useful when the AI is grounded in live operational evidence. A good SRE agent should not simply “sound smart”; it should show what data it used, what query was run, what rows came back, and how confident it is. We also learned that Splunk is a strong platform for agentic operations because it already contains the operational history, telemetry, permissions, and auditability that incident workflows need. MCP gives the LLM a clean tool interface, while Splunk provides the system of record. Finally, we learned that building inside Splunk Web has practical integration details, especially around static assets, caching, authentication, and app packaging. Those details matter if the goal is a deployable native app rather than a demo-only prototype.

What's next for Splunk SRE Agent

Next, we want to expand Splunk SRE Agent from investigation assistant into a fuller agentic operations cockpit. Planned improvements include: Deeper anomaly detection across metrics, logs, traces, deploys, and synthetic checks. Service topology awareness and dependency mapping. Prebuilt runbook templates for common SRE workflows. Approval-based remediation workflows for restart, scale, rollback, or failover actions. Better report export and sharing. More MCP tools for observability, incident management, and change intelligence. Evaluation tests that measure evidence quality, SPL safety, and response usefulness. The long-term goal is to help teams reduce mean time to understand, not just mean time to acknowledge.

Built With

Updates

Peter Huang started this project — Jun 15, 2026 03:45 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.