Inspiration

WebPulse: Autonomous AI Incident War Room

Inspiration

Modern digital infrastructure is increasingly dependent on distributed systems, microservices, and always-on platforms. Yet when failures occur, engineering teams are often overwhelmed by fragmented alerts, scattered logs, delayed root cause identification, and costly downtime.

WebPulse was inspired by the need to transform traditional reactive incident management into an intelligent, autonomous system that acts like a real-time AI Site Reliability Engineer (SRE). Our goal was to build a platform that doesn’t just detect incidents—it investigates, correlates, explains, and recommends solutions instantly.

We wanted to create a system capable of answering critical operational questions automatically:

What failed?
Why did it fail?
Which service is responsible?
How severe is the blast radius?
What actions should be taken immediately?

What We Built

WebPulse is an AI-powered Incident War Room platform designed to provide:

Core Capabilities:

Real-time anomaly detection
Cross-service failure correlation
AI-driven root cause analysis
Automated incident timelines
Recommended remediation strategies
Website and infrastructure health scanning
Multi-agent operational modes:
- Incident analysis
- Monitoring
- ChatOps
- Summary reporting
- Data collection

Key Technologies:

FastAPI for orchestration
Python for modular intelligence pipelines
Mistral/Ollama LLM for AI root cause reasoning
Ngrok for public deployment
Custom correlation engine
Fallback anomaly detection
Swagger UI for testing and demonstration

How We Built It

WebPulse was architected as a modular multi-layer AI system:

Metrics / Logs / Traces
        ↓
Anomaly Detection Engine
        ↓
Correlation Engine
        ↓
AI Root Cause Analysis
        ↓
Recovery Plan + Incident Report
        ↓
Dashboard / API

System Components:

1. Anomaly Detection

We built threshold and signal-based anomaly detection capable of identifying:

High latency
Elevated error rates
Timeout failures
JWT authentication failures
Database degradation

2. Correlation Engine

This engine maps service dependencies and determines probable root services by analyzing:

Logs
Trace summaries
Failure propagation patterns

3. AI Incident Brain

Using structured prompts with LLMs, WebPulse generates:

Root cause hypotheses
Confidence scores
Severity levels
Blast radius estimates
Recommended fixes
Recovery plans

4. Website Monitoring Upgrade

To make WebPulse proactive, we introduced website URL scanning:

Availability
Latency
HTTP degradation
Infrastructure recommendations

5. Multi-Agent Architecture

WebPulse supports multiple operational modes through a unified API:

/analyze
/agent
/scan-website

Challenges We Faced

Integration Complexity

Working with multiple independently developed modules introduced:

Import conflicts
Runtime failures
Unstable localhost dependencies
Broken external services

Solution:

We engineered robust fallback systems to maintain stability even when teammate modules failed.

AI Output Reliability

LLM responses were inconsistent and sometimes malformed.

Solution:

We implemented:

Robust JSON extraction
Response normalization
Smart override logic
Confidence scoring
Fail-safe root cause routing

Deployment Stability

Public accessibility was critical for team integration and judging.

Solution:

Ngrok public tunnels
Safe deployment orchestration
Dedicated fallback pathways

What We Learned

This project taught us that building effective AI systems is not only about intelligence—but about orchestration, reliability, and product execution.

Major Learnings:

Multi-agent system architecture
Fault-tolerant backend engineering
AI prompt design for operational reasoning
Distributed systems diagnostics
API integration at scale
Public deployment workflows
Product-focused hackathon execution

Mathematical Perspective

We modeled incident confidence scoring as:

$$ ConfidenceScore = \begin{cases} 0.9, & \text{High Confidence} \ 0.6, & \text{Medium Confidence} \ 0.3, & \text{Low Confidence} \end{cases} $$

Anomaly severity was approximated through:

$$ Severity = f(Latency, ErrorRate, LogSignals) $$

Where:

High latency
Elevated error rates
Critical logs

increase incident priority and remediation urgency.

Final Outcome

WebPulse evolved from a simple incident analyzer into a scalable autonomous infrastructure intelligence platform.

It demonstrates:

AI orchestration
Autonomous incident response
Real-time fault diagnosis
Website health intelligence
Operational resilience

Future Vision

WebPulse can be extended into:

Kubernetes observability
Cloud infrastructure monitoring
Enterprise DevOps tooling
Predictive failure prevention
Self-healing systems

Built With

fastapi
ngrok
python
vite

Updates

barkathnisha27 Nisha started this project — Apr 26, 2026 02:29 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.