Inspiration:
Every minute of downtime costs enterprises $9,000. On-call engineers drown in alert fatigue, spending 45+ minutes manually correlating noisy signals. We built OpsGuardian AI to be the autonomous sentinel that never sleeps — detecting anomalies, diagnosing root causes, and defending infrastructure in real-time.
What it does:
OpsGuardian AI is an autonomous incident intelligence platform that:
- Detects anomalies via ES|QL real-time queries on Elasticsearch
- Correlates signals with historical incidents via Vector Search
- Reasons through root causes with explainable AI and a 5-layer Anti-Hallucination Pipeline
- Remediates automatically via Elastic Workflows
- Reduces Mean Time to Resolution by 96% (45 min → 2 min)
How we built it:
- Elasticsearch Agent Builder as the core agent orchestration with reasoning model
- ES|QL for real-time anomaly detection queries
- Elasticsearch Search (vector + hybrid) for historical incident retrieval
- Elastic Workflows for automated remediation actions
- Next.js 15 + Tailwind CSS for the dashboard
- Custom Anti-Hallucination Pipeline: Grounding Gate → Evidence Verification → Confidence Scoring → Citation Mapping
- Provider-swappable architecture: every dependency behind interfaces for zero vendor lock-in
Challenges we ran into:
- Designing an anti-hallucination pipeline that verifies AI claims against actual telemetry data
- Making the agent reasoning transparent and auditable (not a black box)
- Balancing demo reliability with real-time intelligence
- Creating a demo arc that emotionally demonstrates the chaos→resolution transformation
Accomplishments that we're proud of:
- 96% MTTR reduction demonstrated in the demo
- 5-layer Anti-Hallucination Engine that cites evidence for every AI claim
- Provider-agnostic architecture — swap any backend without code changes
- Transparent agent reasoning — every step shows the tool used and output
- Full demo mode with deterministic replay for reliable presentation
What we learned:
- Elasticsearch Agent Builder's power in orchestrating multi-tool reasoning workflows
- The importance of AI transparency and anti-hallucination in production systems
- How ES|QL enables real-time analytics that traditional dashboard approaches can't match
- That incident intelligence is the next evolution of observability
What's next for OpsGuardian AI:
- Real Kubernetes and AWS CloudWatch integration for live infrastructure monitoring
- Multi-agent collaboration for complex cross-service incidents
- Autonomous self-healing with approval workflows
- Custom playbook builder for team-specific incident response
- Enterprise features: SSO, multi-tenancy, compliance
Built With
- agent-builder
- ai
- devops
- elastic-workflows
- elasticsearch
- esql
- incident-response
- next.js
- observability
- react
- tailwind-css
- typescript
Log in or sign up for Devpost to join the conversation.