Architecture Diagram

IncidentCop AI

Inspiration

In real production environments, incident investigation is manual and repetitive. When an alert fires, engineers search logs in Elasticsearch, group exceptions, inspect traces, check deployments, and decide whether to escalate. This process can take 30–60 minutes.

We wanted to reduce that time to under 2 minutes using a deterministic, rule-based system that transforms raw telemetry into structured diagnosis — without speculation or semantic reasoning.

What it does

IncidentCop AI automatically investigates production alerts using Elasticsearch data.

Given an alert_id, it:

Retrieves alert details from the alerts index
Fetches error logs within the alert time window
Counts total errors
Identifies the dominant exception
Calculates dominance percentage
Classifies root cause using strict keyword rules
Checks recent deployments
Computes time gap in minutes
Determines correlation strength
Validates failure origin using distributed traces
Calculates deterministic confidence
Triggers Jira and Slack automation if severity is HIGH or CRITICAL

Dominance percentage is calculated as:

$$ \text{dominance_percentage} = \frac{\text{dominant_count}}{\text{total_errors}} \times 100 $$

Confidence scoring model:

+2 if dominance ≥ 60%
+1 if dominance between 40–59%
+2 if trace confirms origin
+1 if no secondary pattern exceeds 30%
+1 if correlation strength is STRONG

Final confidence:

≥ 5 → HIGH
3–4 → MEDIUM
< 3 → LOW

All conclusions are numeric and rule-based.

How we built it

IncidentCop AI is a deterministic investigation pipeline built over Elasticsearch production data.

Core Data Sources

alerts index
logs index
deployments index
traces index

Investigation Workflow

Alert → Logs → Deployment → Traces → Confidence → Automation

Deployment correlation is calculated using:

$$ \text{time_gap_minutes} = \text{alert_start} - \text{deployment_timestamp} $$

Correlation strength thresholds:

≤ 30 → STRONG
31–120 → MODERATE
> 120 → WEAK

Root cause classification is strictly keyword-based:

Timeout / ConnectionRefused / ServiceUnavailable → DOWNSTREAM_DEPENDENCY_FAILURE
NullPointer / IllegalState / IndexOutOfBounds → INTERNAL_APPLICATION_ERROR
Config-related errors → CONFIGURATION_ISSUE

No semantic reasoning layer is used.