Incident response is one of the most stressful and inefficient workflows in modern engineering teams. When something breaks, engineers scramble across dashboards, logs, runbooks, and ticketing tools to understand what happened and what to do next. We were inspired by how much of this work is repetitive, brittle, and time-consuming—and how well it fits an agent-based approach. Elastic Incident Butler was created to show how AI agents can move beyond chat and actually analyze data, reason over context, and take reliable action.

What it does Elastic Incident Butler is a multi-step AI agent that automates incident triage from investigation to resolution. Given a natural-language prompt like “Investigate recent error spikes in service X”, the agent analyzes logs and metrics, identifies anomalies, searches historical incidents for similar patterns, and generates a clear, actionable summary. When confidence is high, it automatically creates Jira tickets and posts structured Slack alerts with evidence and recommendations—reducing human effort and response time.

How we built it We built the agent using Elastic Agent Builder, combining a reasoning model with Elastic’s native tools. ES|QL analyzes time-series logs and metrics to detect anomalies. Semantic Search (RAG) retrieves similar past incidents, runbooks, and resolutions. Elastic Workflows safely execute actions like ticket creation and notifications. Agent Builder orchestrates each step, with execution traces providing visibility and reliability across the workflow.

Challenges we ran into Designing tool schemas for complex ES|QL queries required careful parameterization. Securely configuring workflow integrations (e.g., Jira, Slack) also took extra effort. We also had to balance speed and accuracy when querying large log datasets to keep responses fast.

Accomplishments that we're proud of Reduced incident triage time by ~70% in testing Fully automated investigation-to-action workflow Grounded, explainable outputs backed by real data Clean, modular agent design using Elastic-native tools

What we learned We learned that successful agents depend more on tool orchestration and grounding than on prompt complexity. Elastic’s Agent Builder made it practical to build reliable, explainable agents that act—not just respond.

What's next for Elastic Incident Butler – Auto Incident Triage & Resolution Next, we plan to add auto-remediation playbooks, confidence-based human approvals, multi-agent peer review, and deeper integrations with CI/CD and on-call systems to move from triage to full incident resolution.

Built With

Share this project:

Updates