Inspiration
Support and ops teams lose too much time during incidents because context is fragmented across tickets, logs, chats, and runbooks. Most AI assistants can summarize text, but they don’t reliably complete multi-step operational workflows.
We wanted to build an agent that behaves like a real incident commander: gather evidence, reason through the issue, choose tools, and execute actions safely.
What it does
Elastic CX Incident Commander is a context-driven, multi-step agent system built with Elasticsearch Agent Builder.
It:
- Ingests tickets, logs, KB/runbooks, and event streams into Elasticsearch.
- Uses hybrid/vector retrieval to collect the most relevant incident context.
- Runs ES|QL queries to detect patterns, timelines, and impact signals.
- Produces severity classification, probable root-cause hypotheses, and recommended actions.
- Executes reliable actions (create ticket, assign owner, send team update) with verification gates.
We also use a reviewer step so actions are explainable and auditable before execution.
How we built it
- Indexed structured + unstructured incident data in Elasticsearch.
- Configured Agent Builder with tool access for:
- Search retrieval
- ES|QL analytics
- Workflow/action execution
- Designed a multi-agent flow:
- Triage Agent (severity + business impact)
- Investigator Agent (evidence + root-cause clues)
- Action Agent (execution plan + automation)
- Reviewer Agent (confidence + safety validation)
- Built lightweight API/UI components for demo interactions and result visualization.
- Added measurable output metrics (time saved, steps reduced, confidence score).
Challenges we ran into
- Noisy and conflicting logs: Different sources often suggested different root causes.
- Balancing speed vs reliability: Fully automated actions can be risky without validation.
- Prompt-only behavior drift: We had to enforce tool-first execution and evidence grounding.
Accomplishments that we're proud of
- Built a true multi-step, tool-driven workflow (not a single prompt answer).
- Achieved fast incident triage with evidence-linked recommendations.
- Created clear action traces that explain what was done and why.
- Demonstrated practical impact with rough benchmark improvements:
- Triage time reduced from ~20 min to ~3 min
- Manual handoff steps reduced by ~35–50%
What we learned
- Retrieval quality is everything for reliable agent decisions.
- ES|QL is powerful for time-based and operational diagnostics.
- Multi-agent verification significantly improves trust in automated actions.
- Practical AI agents need execution controls, not just model intelligence.
What's next for Elastic CX Incident Commander
- Deeper integrations (Slack, Jira, PagerDuty, GitHub).
- Continuous learning loop from incident outcomes and analyst feedback.
- Domain packs (fintech, healthcare, DevOps, customer support).
- Policy-aware action controls for enterprise governance and compliance.
Built With
- agent
- elastic
- elasticsearch

Log in or sign up for Devpost to join the conversation.