AegisOps

Landing Page
Splunk integration
Github integration
aSplunk dashboard 2
Splunk dashboard 1
PR fix
Pr raise
Dashboard
Agents working parallely
Plan and execution
Analysis
Login Page

Inspiration

Every SRE and security engineer knows the 3 AM pager pain. An alert fires, you scramble to correlate logs across Splunk indexes, chase down the root cause, and manually apply fixes—often repeating the same process for recurring issues. We asked: What if AI agents could handle the entire incident lifecycle autonomously?

The Splunk MCP (Model Context Protocol) Server opened a new paradigm—giving AI agents direct, structured access to Splunk's observability and security data. We realized this could become the "nervous system" for a new kind of autonomous operations platform where specialized agents collaborate in real-time to detect, diagnose, and remediate incidents without human intervention.

What it does

AegisOps is an autonomous multi-agent platform that unifies Security, Observability, and Platform Engineering into a single intelligent system:

Autonomous Detection: Continuously monitors Splunk for anomalies (latency spikes, error surges, security threats, auth attacks) and automatically triggers incident workflows
Parallel Agent Analysis:
- Healer Agent analyzes APM traces, latency patterns, and error logs
- Sentinel Agent cross-references firewall logs, threat indicators, and auth patterns
- Agents run in parallel and query Splunk MCP for real-time data
Intelligent Correlation: A Correlator synthesizes findings to determine if it's infrastructure, security, or mixed—with confidence scoring based on institutional memory
Auto-Remediation with Human Oversight:
- Generates WAF rules, network isolation, Edge Processor rules
- Analyzes actual code from GitHub and creates PRs with fixes
- Human approval gate before execution
Learning Loop: Every approved/rejected action feeds back into Splunk as "institutional memory," making future diagnoses faster and more accurate

How we built it

Backend: Node.js + TypeScript + Express with WebSocket for real-time streaming
Agent Orchestration: LangGraph JS for parallel agent execution with state management
LLM: Claude (Anthropic) for reasoning, diagnosis, and code fix generation
Splunk Integration: MCP Server (JSON-RPC 2.0) for splunk_run_query, splunk_get_indexes, and data ingestion
Frontend: React 18 + Vite + TailwindCSS with SSE for live agent activity streaming
GitHub Integration: Full PR workflow—explores repos, analyzes code, generates fixes, creates branches and PRs

The architecture follows a true agentic pattern: each agent has specific tools and capabilities, they communicate through a shared state graph, and the system maintains context across the entire incident lifecycle.

Challenges we ran into

Real-time Correlation: Merging findings from parallel agents while maintaining causality was tricky. We solved it with LangGraph's StateGraph and careful event sequencing.
Code Analysis at Scale: Finding the right file to fix in a repository required building intelligent file discovery—we analyze repo structure and match files to incident types (e.g., database files for connection pool issues).
Avoiding Alert Fatigue: The autonomous detection needed cooldown logic and deduplication to prevent flooding users with similar incidents.

Accomplishments that we're proud of

True autonomy: From Splunk data anomaly → AI analysis → GitHub PR creation with zero human intervention (until approval)
Institutional memory: The system literally learns from past incidents stored in Splunk
Production-grade architecture: Proper auth, encryption, multi-tenant support, SSE streaming
Real Splunk MCP integration: Not mocked—actually queries live Splunk Cloud data