Results

Autonomous Incident Commander

Inspiration

Modern systems generate an overwhelming amount of operational data. When a critical incident occurs, engineers often spend valuable time switching between dashboards, searching logs, correlating events, building timelines, and manually identifying root causes before they can even begin fixing the problem.

The irony is that organizations have more observability data than ever before, yet incident investigations remain largely manual.

We asked ourselves a simple question:

What if the machine could become the incident commander?

Instead of building another chatbot that answers questions about infrastructure, we wanted to build a system that actively investigates incidents, reasons through evidence, identifies root causes, and generates actionable reports without requiring a human to drive the process.

That idea became Autonomous Incident Commander.

What It Does

Autonomous Incident Commander is an AI-powered incident response platform built on Splunk that transforms raw operational data into complete incident investigations.

When an alert is triggered, the platform automatically:

Collects logs, events, and telemetry from Splunk
Correlates data across multiple services
Constructs a chronological incident timeline
Identifies the most probable root cause
Calculates a confidence score
Generates remediation recommendations
Produces an executive-ready incident report

Instead of asking an engineer to manually investigate an outage, the system performs the investigation autonomously and presents the findings in minutes.

The result is a dramatic reduction in Mean Time To Resolution (MTTR) and faster operational response.

How We Built It

The platform is built using a multi-agent architecture where each AI agent has a dedicated responsibility.

Investigation Agent

Retrieves incident-related logs, events, and telemetry from Splunk using Splunk MCP integrations.

Timeline Agent

Analyzes the collected events and reconstructs the sequence of failures that led to the incident.

Root Cause Agent

Reasons over the evidence and identifies the most likely source of failure while generating supporting explanations.

Severity Agent

Determines incident criticality based on impact, affected services, and system behavior.

Remediation Agent

Generates immediate, short-term, and long-term recommendations to resolve and prevent the issue.

Report Agent

Synthesizes all findings into a structured incident report that can be shared with engineering and leadership teams.

Technology Stack

Splunk Enterprise
Splunk MCP Server
Anthropic Claude
FastAPI
Python
Next.js
TypeScript
TailwindCSS
PostgreSQL
Server-Sent Events (SSE)

The frontend streams investigation progress in real time, allowing users to watch each agent contribute to the investigation as it unfolds.

Challenges We Ran Into

Turning AI Into an Investigator Instead of a Chatbot

Many AI systems are designed to answer questions when prompted. Our challenge was building agents that proactively investigate incidents and collaborate with one another without human intervention.

Correlating Events Across Services

Real incidents rarely originate from a single log line. We had to design workflows that connect failures across databases, APIs, payment systems, and application services to reconstruct the complete story.

Structured Agent Collaboration

Each agent produces structured outputs that are passed to downstream agents. Designing reliable interfaces between agents was critical to maintaining consistency and reducing hallucinations.

Creating Explainable Results

Engineers need evidence, not just conclusions. Every root cause determination had to be backed by actual logs and observable events so users could trust the system's findings.

Real-Time User Experience

We wanted the investigation to feel alive. Streaming logs, timelines, reasoning, and reports in real time required careful orchestration between the backend agents and frontend interface.

What We Learned

Building Autonomous Incident Commander taught us that the future of observability is not simply providing more dashboards—it is enabling systems to reason about operational data autonomously.

We learned how to:

Build multi-agent AI workflows
Integrate AI systems with operational telemetry
Design explainable AI outputs for engineers
Orchestrate real-time streaming experiences
Use Splunk MCP as a bridge between AI agents and observability data

Most importantly, we learned that AI becomes significantly more powerful when it is given a clear operational role rather than functioning as a general-purpose assistant.

Why It Matters

Every minute spent investigating an outage can translate into lost revenue, degraded customer experience, and increased operational stress.

Traditional incident response follows a manual process:

Alert → Investigation → Root Cause Analysis → Reporting

Autonomous Incident Commander transforms that workflow into:

Alert → Autonomous Investigation → Actionable Resolution

By allowing AI agents to perform the investigative work, organizations can focus on solving problems rather than searching for them.

What's Next

Our vision extends beyond incident investigation.

Future versions could:

Automatically create Jira tickets
Trigger Slack and PagerDuty workflows
Execute approved remediation actions
Detect anomalies before alerts fire
Learn from historical incidents
Coordinate multiple AI agents across security and observability domains

Ultimately, we envision a future where operational systems are capable of understanding, explaining, and responding to incidents with minimal human intervention.

Autonomous Incident Commander is a step toward that future.

Built With

ai-agents
anthropic-claude
fastapi
incident-management
incident-response-automation
log-analysis
multi-agent-ai
next.js
observability
postgresql
pydantic
python
react
real-time-streaming
root-cause-analysis
server-sent-events
splunk-enterprise
splunk-mcp-server
splunk-rest-api
sqlalchemy
tailwindcss
telemetry-correlation
typescript

Submitted to

Splunk Agentic Ops Hackathon

Created by

I designed and developed the entire Autonomous Incident Commander platform from concept to implementation.

My contributions included:

* Defining the product vision and overall system architecture
* Designing the multi-agent workflow for incident investigation
* Building the backend APIs using FastAPI and Python
* Implementing Splunk integrations and data ingestion workflows
* Developing AI agents for investigation, timeline generation, root cause analysis, severity classification, remediation planning, and report generation
* Building the frontend dashboard using Next.js, React, TypeScript, and TailwindCSS
* Creating the real-time investigation experience with streaming updates
* Designing the incident simulator and realistic log generation system
* Implementing PDF report generation and executive summaries
* Creating the demo environment, documentation, architecture diagrams, and project presentation materials

The project was developed end-to-end by me, including ideation, architecture, engineering, testing, and demo preparation.

Abhishek Jha

Updates

Abhishek Jha started this project — Jun 15, 2026 06:17 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.