🔥 Inspiration

Production incidents keep engineers awake at night — literally. The average on-call engineer gets paged 5-10 times per week for incidents that follow the same pattern: alert fires, someone opens 6 dashboards, spends 90 minutes correlating logs and metrics, finds the bad deployment, rolls back. Repeat forever.

The 2024 State of DevOps Report shows MTTR for critical incidents averages 2.5 hours. That's not a tooling problem — it's a reasoning problem. Dashboards give you data; they don't think. I wanted to build a system that actually investigates like a senior engineer would.

🏗️ How I Built It

OpsGuard AI runs entirely on Elastic Agent Builder with a single unified Commander Agent that performs multi-perspective analysis across 4 phases:

Phase 1 — Monitor: ES|QL time-series queries using STATS, CASE, and EVAL detect CPU, memory, and error rate anomalies across all services in real time.

Phase 2 — Diagnose: Parameterized ES|QL queries correlate logs with recent deployments. Then semantic_text vector search finds similar historical incidents — zero embedding pipeline needed.

Phase 3 — Impact: ES|QL EVAL expressions compare current revenue metrics against baselines, translating technical degradation into dollars per hour ($12,450/hr in the demo).

Phase 4 — Act: Two Elastic Workflows execute deterministically — an incident ticket lands in opsguard-active, a formatted alert goes to opsguard-notifications, and everything is logged to opsguard-audit for compliance.

The key architectural innovation is disagreement resolution: when multiple root cause hypotheses emerge, the agent scores each by temporal correlation, evidence quality, and historical precedent — then explains its decision transparently.

Hypothesis A (87%): Bad deployment v2.4.2 → deployment at 14:32, first error at 14:34 ✓ → matches INC-2026-001 exactly ✓

Hypothesis B (42%): Database degradation → DB metrics elevated, but timing doesn't align ✗

Decision: Hypothesis A — rollback v2.4.2

🧠 What I Learned

  1. ES|QL is genuinely powerful for agentic workflows. Chaining FROM → WHERE → STATS → EVAL → CASE → SORT reads like a natural investigation process. The piped syntax maps directly to how engineers think about incident analysis.

  2. semantic_text is a game-changer. I expected to spend days setting up an embedding pipeline (model selection, chunking, inference endpoints). Instead I declared "type": "semantic_text" in the mapping and vector search just worked. That's the right abstraction.

  3. Workflows are the right primitive for agent actions. An LLM deciding what to do + a deterministic Workflow executing it = explainability + reliability. The audit trail in opsguard-audit means every action the agent takes is traceable.

## ⚔️ Challenges

Agent disagreement scoring was harder than expected. Telling the agent to "pick the most likely root cause" produced inconsistent, overconfident results. The fix was defining explicit scoring bands (90-100% = strong evidence, 70-89% = good, 50-69% = moderate) and forcing temporal correlation analysis before final selection.

Elastic Serverless index naming caused early failures — logs-* and metrics-* prefixes are reserved for built-in data streams on Serverless. Switching to opsguard-* naming resolved everything and is actually cleaner.

📊 Results

| Metric | Before | After | Δ | |--------|--------|-------|---| | MTTR | 2.5 hours | 1m 42s | 97% faster | | Revenue loss / incident | $10K–50K | $500–2K | 80–95% less | | On-call pages resolved automatically | 0% | ~80% | fully automated triage | | Root cause accuracy | manual, variable | 87% confidence scored | transparent + explainable |

Built With

Share this project:

Updates