Inspiration
WebPulse: Autonomous AI Incident War Room
Inspiration
Modern digital infrastructure is increasingly dependent on distributed systems, microservices, and always-on platforms. Yet when failures occur, engineering teams are often overwhelmed by fragmented alerts, scattered logs, delayed root cause identification, and costly downtime.
WebPulse was inspired by the need to transform traditional reactive incident management into an intelligent, autonomous system that acts like a real-time AI Site Reliability Engineer (SRE). Our goal was to build a platform that doesn’t just detect incidents—it investigates, correlates, explains, and recommends solutions instantly.
We wanted to create a system capable of answering critical operational questions automatically:
- What failed?
- Why did it fail?
- Which service is responsible?
- How severe is the blast radius?
- What actions should be taken immediately?
What We Built
WebPulse is an AI-powered Incident War Room platform designed to provide:
Core Capabilities:
- Real-time anomaly detection
- Cross-service failure correlation
- AI-driven root cause analysis
- Automated incident timelines
- Recommended remediation strategies
- Website and infrastructure health scanning
Multi-agent operational modes:
- Incident analysis
- Monitoring
- ChatOps
- Summary reporting
- Data collection
Key Technologies:
- FastAPI for orchestration
- Python for modular intelligence pipelines
- Mistral/Ollama LLM for AI root cause reasoning
- Ngrok for public deployment
- Custom correlation engine
- Fallback anomaly detection
- Swagger UI for testing and demonstration
How We Built It
WebPulse was architected as a modular multi-layer AI system:
Metrics / Logs / Traces
↓
Anomaly Detection Engine
↓
Correlation Engine
↓
AI Root Cause Analysis
↓
Recovery Plan + Incident Report
↓
Dashboard / API
System Components:
1. Anomaly Detection
We built threshold and signal-based anomaly detection capable of identifying:
- High latency
- Elevated error rates
- Timeout failures
- JWT authentication failures
- Database degradation
2. Correlation Engine
This engine maps service dependencies and determines probable root services by analyzing:
- Logs
- Trace summaries
- Failure propagation patterns
3. AI Incident Brain
Using structured prompts with LLMs, WebPulse generates:
- Root cause hypotheses
- Confidence scores
- Severity levels
- Blast radius estimates
- Recommended fixes
- Recovery plans
4. Website Monitoring Upgrade
To make WebPulse proactive, we introduced website URL scanning:
- Availability
- Latency
- HTTP degradation
- Infrastructure recommendations
5. Multi-Agent Architecture
WebPulse supports multiple operational modes through a unified API:
/analyze/agent/scan-website
Challenges We Faced
Integration Complexity
Working with multiple independently developed modules introduced:
- Import conflicts
- Runtime failures
- Unstable localhost dependencies
- Broken external services
Solution:
We engineered robust fallback systems to maintain stability even when teammate modules failed.
AI Output Reliability
LLM responses were inconsistent and sometimes malformed.
Solution:
We implemented:
- Robust JSON extraction
- Response normalization
- Smart override logic
- Confidence scoring
- Fail-safe root cause routing
Deployment Stability
Public accessibility was critical for team integration and judging.
Solution:
- Ngrok public tunnels
- Safe deployment orchestration
- Dedicated fallback pathways
What We Learned
This project taught us that building effective AI systems is not only about intelligence—but about orchestration, reliability, and product execution.
Major Learnings:
- Multi-agent system architecture
- Fault-tolerant backend engineering
- AI prompt design for operational reasoning
- Distributed systems diagnostics
- API integration at scale
- Public deployment workflows
- Product-focused hackathon execution
Mathematical Perspective
We modeled incident confidence scoring as:
$$ ConfidenceScore = \begin{cases} 0.9, & \text{High Confidence} \ 0.6, & \text{Medium Confidence} \ 0.3, & \text{Low Confidence} \end{cases} $$
Anomaly severity was approximated through:
$$ Severity = f(Latency, ErrorRate, LogSignals) $$
Where:
- High latency
- Elevated error rates
- Critical logs
increase incident priority and remediation urgency.
Final Outcome
WebPulse evolved from a simple incident analyzer into a scalable autonomous infrastructure intelligence platform.
It demonstrates:
- AI orchestration
- Autonomous incident response
- Real-time fault diagnosis
- Website health intelligence
- Operational resilience
Future Vision
WebPulse can be extended into:
- Kubernetes observability
- Cloud infrastructure monitoring
- Enterprise DevOps tooling
- Predictive failure prevention
- Self-healing systems
Built With
- fastapi
- ngrok
- python
- vite
Log in or sign up for Devpost to join the conversation.