Inspiration

Modern operations teams are overwhelmed by alerts, dashboards, and fragmented troubleshooting workflows. Engineers often spend valuable time manually correlating logs, metrics, deployments, and historical incidents before identifying a root cause. We wanted to build a system that could act like an experienced SRE on demand—automatically investigating incidents, connecting the dots across observability data, and recommending actions within minutes.

OpsPilot AI was inspired by the vision of autonomous operations, where AI agents collaborate to reduce Mean Time To Resolution (MTTR) and allow teams to focus on solving problems rather than gathering context.

What it does

OpsPilot AI is an adaptive, AI-powered incident investigation and response platform built on Splunk MCP and LangGraph.

When an alert is triggered, OpsPilot AI:

Collects context from Splunk through the Splunk MCP Server
Detects anomalies and unusual system behavior
Dynamically builds an investigation plan
Orchestrates specialized AI agents to analyze logs, metrics, deployments, runbooks, and historical incidents
Generates a root cause analysis (RCA)
Produces a detailed incident timeline
Recommends remediation actions
Supports human approval workflows before execution
Generates executive summaries and incident reports
Streams investigation progress in real time through a live dashboard

The platform is domain-agnostic and can adapt to different environments, including cloud infrastructure, Kubernetes workloads, SaaS platforms, enterprise applications, and other Splunk-observable systems.

How we built it

We built OpsPilot AI using:

Splunk Platform

Splunk Enterprise
Splunk MCP Server
SPL searches
Splunk observability and operational data

Agentic AI Stack

LangGraph for orchestration
Gemini for reasoning and structured analysis
Dynamic investigation planning
Multi-agent workflows

Backend

FastAPI
Python
Async MCP integrations
WebSocket streaming

AI Agents

Classification Agent
Planner Agent
Log Agent
Metrics Agent
Anomaly Agent
Deployment Agent
Runbook Agent
Timeline Agent
Memory Agent
RCA Agent
Remediation Agent
Executive Summary Agent

Platform Features

Historical incident memory
Dynamic runbook retrieval
Incident timelines
Human-in-the-loop approvals
Real-time investigation dashboards
Automated report generation

Challenges we ran into

One of the biggest challenges was integrating the Splunk MCP Server with a local Splunk Enterprise environment. During setup we encountered KV Store failures caused by certificate and private key mismatches, preventing token generation and MCP authentication. Diagnosing and repairing the certificate chain required deep investigation into Splunk's internal security and KV Store infrastructure.

Another challenge was moving beyond a fixed workflow. Early versions of OpsPilot followed a static sequence of agents. We redesigned the system to support dynamic investigation planning, allowing the platform to adapt its workflow based on the incident type rather than relying on hardcoded logic.

We also had to balance automation with operational safety by introducing approval workflows before remediation actions could be executed.

Accomplishments that we're proud of

Successfully integrated Splunk MCP Server into a working agentic operations platform
Built a fully functional multi-agent investigation workflow
Implemented dynamic investigation planning and adaptive agent orchestration
Created a real-time incident investigation dashboard
Added historical incident memory and similarity search
Generated automated root cause analyses and executive summaries
Built human-in-the-loop remediation approval workflows
Developed a domain-agnostic architecture capable of supporting multiple environments
Achieved successful end-to-end incident investigations with automatically generated reports

What we learned

Through this project we learned that incident response is not just a data problem—it is a context problem. The most valuable capability was not simply retrieving logs, but enabling AI agents to reason across deployments, historical incidents, metrics, runbooks, and operational knowledge simultaneously.

We also learned the importance of adaptive workflows. Different incidents require different investigation paths, and dynamic agent selection is far more powerful than static automation pipelines.

Most importantly, we gained hands-on experience building production-style agentic systems using MCP, LangGraph, FastAPI, and Splunk.

What's next for OpsPilot AI

Our vision is to evolve OpsPilot AI into a fully autonomous operations platform.

Future plans include:

Expanding domain-specific investigation agents for security, networking, Kubernetes, and cloud operations
Integrating Splunk dashboards and application packaging for native platform deployment
Supporting auto-generated runbooks from resolved incidents
Introducing predictive incident prevention and forecasting
Enhancing remediation automation with policy-based guardrails
Adding enterprise-scale persistence using Splunk KV Store and distributed storage backends
Building deeper integrations with operational workflows and collaboration tools

Ultimately, we aim to transform OpsPilot AI from an incident investigation assistant into an adaptive AI operations engineer capable of continuously improving system reliability at scale.

Built With

fastapi
gemini
langgraph
python
react
splunk
splunkmcp

Updates

kshitij kumrawat started this project — Jun 15, 2026 11:36 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.