Elastic CX Incident Commander

Inspiration

Support and ops teams lose too much time during incidents because context is fragmented across tickets, logs, chats, and runbooks. Most AI assistants can summarize text, but they don’t reliably complete multi-step operational workflows.
We wanted to build an agent that behaves like a real incident commander: gather evidence, reason through the issue, choose tools, and execute actions safely.

What it does

Elastic CX Incident Commander is a context-driven, multi-step agent system built with Elasticsearch Agent Builder.

It:

Ingests tickets, logs, KB/runbooks, and event streams into Elasticsearch.
Uses hybrid/vector retrieval to collect the most relevant incident context.
Runs ES|QL queries to detect patterns, timelines, and impact signals.
Produces severity classification, probable root-cause hypotheses, and recommended actions.
Executes reliable actions (create ticket, assign owner, send team update) with verification gates.

We also use a reviewer step so actions are explainable and auditable before execution.

How we built it

Indexed structured + unstructured incident data in Elasticsearch.
Configured Agent Builder with tool access for:
- Search retrieval
- ES|QL analytics
- Workflow/action execution
Designed a multi-agent flow:
- Triage Agent (severity + business impact)
- Investigator Agent (evidence + root-cause clues)
- Action Agent (execution plan + automation)
- Reviewer Agent (confidence + safety validation)
Built lightweight API/UI components for demo interactions and result visualization.
Added measurable output metrics (time saved, steps reduced, confidence score).

Challenges we ran into

Noisy and conflicting logs: Different sources often suggested different root causes.
Balancing speed vs reliability: Fully automated actions can be risky without validation.
Prompt-only behavior drift: We had to enforce tool-first execution and evidence grounding.

Accomplishments that we're proud of

Built a true multi-step, tool-driven workflow (not a single prompt answer).
Achieved fast incident triage with evidence-linked recommendations.
Created clear action traces that explain what was done and why.
Demonstrated practical impact with rough benchmark improvements:
- Triage time reduced from ~20 min to ~3 min
- Manual handoff steps reduced by ~35–50%

What we learned

Retrieval quality is everything for reliable agent decisions.
ES|QL is powerful for time-based and operational diagnostics.
Multi-agent verification significantly improves trust in automated actions.
Practical AI agents need execution controls, not just model intelligence.

What's next for Elastic CX Incident Commander

Deeper integrations (Slack, Jira, PagerDuty, GitHub).
Continuous learning loop from incident outcomes and analyst feedback.
Domain packs (fintech, healthcare, DevOps, customer support).
Policy-aware action controls for enterprise governance and compliance.

Built With

agent
elastic
elasticsearch

Updates

Yash Kavaiya started this project — Feb 22, 2026 07:42 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.