Inspiration
Modern operations teams are overwhelmed by alerts, dashboards, and fragmented troubleshooting workflows. Engineers often spend valuable time manually correlating logs, metrics, deployments, and historical incidents before identifying a root cause. We wanted to build a system that could act like an experienced SRE on demand—automatically investigating incidents, connecting the dots across observability data, and recommending actions within minutes.
OpsPilot AI was inspired by the vision of autonomous operations, where AI agents collaborate to reduce Mean Time To Resolution (MTTR) and allow teams to focus on solving problems rather than gathering context.
What it does
OpsPilot AI is an adaptive, AI-powered incident investigation and response platform built on Splunk MCP and LangGraph.
When an alert is triggered, OpsPilot AI:
- Collects context from Splunk through the Splunk MCP Server
- Detects anomalies and unusual system behavior
- Dynamically builds an investigation plan
- Orchestrates specialized AI agents to analyze logs, metrics, deployments, runbooks, and historical incidents
- Generates a root cause analysis (RCA)
- Produces a detailed incident timeline
- Recommends remediation actions
- Supports human approval workflows before execution
- Generates executive summaries and incident reports
- Streams investigation progress in real time through a live dashboard
The platform is domain-agnostic and can adapt to different environments, including cloud infrastructure, Kubernetes workloads, SaaS platforms, enterprise applications, and other Splunk-observable systems.
How we built it
We built OpsPilot AI using:
Splunk Platform
- Splunk Enterprise
- Splunk MCP Server
- SPL searches
- Splunk observability and operational data
Agentic AI Stack
- LangGraph for orchestration
- Gemini for reasoning and structured analysis
- Dynamic investigation planning
- Multi-agent workflows
Backend
- FastAPI
- Python
- Async MCP integrations
- WebSocket streaming
AI Agents
- Classification Agent
- Planner Agent
- Log Agent
- Metrics Agent
- Anomaly Agent
- Deployment Agent
- Runbook Agent
- Timeline Agent
- Memory Agent
- RCA Agent
- Remediation Agent
- Executive Summary Agent
Platform Features
- Historical incident memory
- Dynamic runbook retrieval
- Incident timelines
- Human-in-the-loop approvals
- Real-time investigation dashboards
- Automated report generation
Challenges we ran into
One of the biggest challenges was integrating the Splunk MCP Server with a local Splunk Enterprise environment. During setup we encountered KV Store failures caused by certificate and private key mismatches, preventing token generation and MCP authentication. Diagnosing and repairing the certificate chain required deep investigation into Splunk's internal security and KV Store infrastructure.
Another challenge was moving beyond a fixed workflow. Early versions of OpsPilot followed a static sequence of agents. We redesigned the system to support dynamic investigation planning, allowing the platform to adapt its workflow based on the incident type rather than relying on hardcoded logic.
We also had to balance automation with operational safety by introducing approval workflows before remediation actions could be executed.
Accomplishments that we're proud of
- Successfully integrated Splunk MCP Server into a working agentic operations platform
- Built a fully functional multi-agent investigation workflow
- Implemented dynamic investigation planning and adaptive agent orchestration
- Created a real-time incident investigation dashboard
- Added historical incident memory and similarity search
- Generated automated root cause analyses and executive summaries
- Built human-in-the-loop remediation approval workflows
- Developed a domain-agnostic architecture capable of supporting multiple environments
- Achieved successful end-to-end incident investigations with automatically generated reports
What we learned
Through this project we learned that incident response is not just a data problem—it is a context problem. The most valuable capability was not simply retrieving logs, but enabling AI agents to reason across deployments, historical incidents, metrics, runbooks, and operational knowledge simultaneously.
We also learned the importance of adaptive workflows. Different incidents require different investigation paths, and dynamic agent selection is far more powerful than static automation pipelines.
Most importantly, we gained hands-on experience building production-style agentic systems using MCP, LangGraph, FastAPI, and Splunk.
What's next for OpsPilot AI
Our vision is to evolve OpsPilot AI into a fully autonomous operations platform.
Future plans include:
- Expanding domain-specific investigation agents for security, networking, Kubernetes, and cloud operations
- Integrating Splunk dashboards and application packaging for native platform deployment
- Supporting auto-generated runbooks from resolved incidents
- Introducing predictive incident prevention and forecasting
- Enhancing remediation automation with policy-based guardrails
- Adding enterprise-scale persistence using Splunk KV Store and distributed storage backends
- Building deeper integrations with operational workflows and collaboration tools
Ultimately, we aim to transform OpsPilot AI from an incident investigation assistant into an adaptive AI operations engineer capable of continuously improving system reliability at scale.
Log in or sign up for Devpost to join the conversation.