Inspiration

WebPulse: Autonomous AI Incident War Room

Inspiration

Modern digital infrastructure is increasingly dependent on distributed systems, microservices, and always-on platforms. Yet when failures occur, engineering teams are often overwhelmed by fragmented alerts, scattered logs, delayed root cause identification, and costly downtime.

WebPulse was inspired by the need to transform traditional reactive incident management into an intelligent, autonomous system that acts like a real-time AI Site Reliability Engineer (SRE). Our goal was to build a platform that doesn’t just detect incidents—it investigates, correlates, explains, and recommends solutions instantly.

We wanted to create a system capable of answering critical operational questions automatically:

  • What failed?
  • Why did it fail?
  • Which service is responsible?
  • How severe is the blast radius?
  • What actions should be taken immediately?

What We Built

WebPulse is an AI-powered Incident War Room platform designed to provide:

Core Capabilities:

  • Real-time anomaly detection
  • Cross-service failure correlation
  • AI-driven root cause analysis
  • Automated incident timelines
  • Recommended remediation strategies
  • Website and infrastructure health scanning
  • Multi-agent operational modes:

    • Incident analysis
    • Monitoring
    • ChatOps
    • Summary reporting
    • Data collection

Key Technologies:

  • FastAPI for orchestration
  • Python for modular intelligence pipelines
  • Mistral/Ollama LLM for AI root cause reasoning
  • Ngrok for public deployment
  • Custom correlation engine
  • Fallback anomaly detection
  • Swagger UI for testing and demonstration

How We Built It

WebPulse was architected as a modular multi-layer AI system:

Metrics / Logs / Traces
        ↓
Anomaly Detection Engine
        ↓
Correlation Engine
        ↓
AI Root Cause Analysis
        ↓
Recovery Plan + Incident Report
        ↓
Dashboard / API

System Components:

1. Anomaly Detection

We built threshold and signal-based anomaly detection capable of identifying:

  • High latency
  • Elevated error rates
  • Timeout failures
  • JWT authentication failures
  • Database degradation

2. Correlation Engine

This engine maps service dependencies and determines probable root services by analyzing:

  • Logs
  • Trace summaries
  • Failure propagation patterns

3. AI Incident Brain

Using structured prompts with LLMs, WebPulse generates:

  • Root cause hypotheses
  • Confidence scores
  • Severity levels
  • Blast radius estimates
  • Recommended fixes
  • Recovery plans

4. Website Monitoring Upgrade

To make WebPulse proactive, we introduced website URL scanning:

  • Availability
  • Latency
  • HTTP degradation
  • Infrastructure recommendations

5. Multi-Agent Architecture

WebPulse supports multiple operational modes through a unified API:

  • /analyze
  • /agent
  • /scan-website

Challenges We Faced

Integration Complexity

Working with multiple independently developed modules introduced:

  • Import conflicts
  • Runtime failures
  • Unstable localhost dependencies
  • Broken external services

Solution:

We engineered robust fallback systems to maintain stability even when teammate modules failed.


AI Output Reliability

LLM responses were inconsistent and sometimes malformed.

Solution:

We implemented:

  • Robust JSON extraction
  • Response normalization
  • Smart override logic
  • Confidence scoring
  • Fail-safe root cause routing

Deployment Stability

Public accessibility was critical for team integration and judging.

Solution:

  • Ngrok public tunnels
  • Safe deployment orchestration
  • Dedicated fallback pathways

What We Learned

This project taught us that building effective AI systems is not only about intelligence—but about orchestration, reliability, and product execution.

Major Learnings:

  • Multi-agent system architecture
  • Fault-tolerant backend engineering
  • AI prompt design for operational reasoning
  • Distributed systems diagnostics
  • API integration at scale
  • Public deployment workflows
  • Product-focused hackathon execution

Mathematical Perspective

We modeled incident confidence scoring as:

$$ ConfidenceScore = \begin{cases} 0.9, & \text{High Confidence} \ 0.6, & \text{Medium Confidence} \ 0.3, & \text{Low Confidence} \end{cases} $$

Anomaly severity was approximated through:

$$ Severity = f(Latency, ErrorRate, LogSignals) $$

Where:

  • High latency
  • Elevated error rates
  • Critical logs

increase incident priority and remediation urgency.


Final Outcome

WebPulse evolved from a simple incident analyzer into a scalable autonomous infrastructure intelligence platform.

It demonstrates:

  • AI orchestration
  • Autonomous incident response
  • Real-time fault diagnosis
  • Website health intelligence
  • Operational resilience

Future Vision

WebPulse can be extended into:

  • Kubernetes observability
  • Cloud infrastructure monitoring
  • Enterprise DevOps tooling
  • Predictive failure prevention
  • Self-healing systems

Built With

Share this project:

Updates