Inspiration
Every SRE and security engineer knows the 3 AM pager pain. An alert fires, you scramble to correlate logs across Splunk indexes, chase down the root cause, and manually apply fixes—often repeating the same process for recurring issues. We asked: What if AI agents could handle the entire incident lifecycle autonomously?
The Splunk MCP (Model Context Protocol) Server opened a new paradigm—giving AI agents direct, structured access to Splunk's observability and security data. We realized this could become the "nervous system" for a new kind of autonomous operations platform where specialized agents collaborate in real-time to detect, diagnose, and remediate incidents without human intervention.
What it does
AegisOps is an autonomous multi-agent platform that unifies Security, Observability, and Platform Engineering into a single intelligent system:
Autonomous Detection: Continuously monitors Splunk for anomalies (latency spikes, error surges, security threats, auth attacks) and automatically triggers incident workflows
Parallel Agent Analysis:
- Healer Agent analyzes APM traces, latency patterns, and error logs
- Sentinel Agent cross-references firewall logs, threat indicators, and auth patterns
- Agents run in parallel and query Splunk MCP for real-time data
Intelligent Correlation: A Correlator synthesizes findings to determine if it's infrastructure, security, or mixed—with confidence scoring based on institutional memory
Auto-Remediation with Human Oversight:
- Generates WAF rules, network isolation, Edge Processor rules
- Analyzes actual code from GitHub and creates PRs with fixes
- Human approval gate before execution
Learning Loop: Every approved/rejected action feeds back into Splunk as "institutional memory," making future diagnoses faster and more accurate
How we built it
- Backend: Node.js + TypeScript + Express with WebSocket for real-time streaming
- Agent Orchestration: LangGraph JS for parallel agent execution with state management
- LLM: Claude (Anthropic) for reasoning, diagnosis, and code fix generation
- Splunk Integration: MCP Server (JSON-RPC 2.0) for
splunk_run_query,splunk_get_indexes, and data ingestion - Frontend: React 18 + Vite + TailwindCSS with SSE for live agent activity streaming
- GitHub Integration: Full PR workflow—explores repos, analyzes code, generates fixes, creates branches and PRs
The architecture follows a true agentic pattern: each agent has specific tools and capabilities, they communicate through a shared state graph, and the system maintains context across the entire incident lifecycle.
Challenges we ran into
Real-time Correlation: Merging findings from parallel agents while maintaining causality was tricky. We solved it with LangGraph's StateGraph and careful event sequencing.
Code Analysis at Scale: Finding the right file to fix in a repository required building intelligent file discovery—we analyze repo structure and match files to incident types (e.g., database files for connection pool issues).
Avoiding Alert Fatigue: The autonomous detection needed cooldown logic and deduplication to prevent flooding users with similar incidents.
Accomplishments that we're proud of
- True autonomy: From Splunk data anomaly → AI analysis → GitHub PR creation with zero human intervention (until approval)
- Institutional memory: The system literally learns from past incidents stored in Splunk
- Production-grade architecture: Proper auth, encryption, multi-tenant support, SSE streaming
- Real Splunk MCP integration: Not mocked—actually queries live Splunk Cloud data
What's next for AegisOps
- Splunk SOAR integration for automated playbook execution
- Multi-cloud remediation (AWS, GCP, Azure) via MCP tools
- Predictive incident prevention using historical patterns
- Team collaboration features for incident war rooms
Built With
- api
- claude
- cloud
- express.js
- github
- langgraph
- mcp
- node.js
- react
- server
- server-sent
- splunk
- sqlite
- sse-events
- tailwindcss
- typescript
- vite
- websocket
- zod

Log in or sign up for Devpost to join the conversation.