Inspiration

Modern production environments generate massive volumes of logs across many services, making incident triage slow, manual, and inconsistent. Operations and compliance teams often struggle to quickly understand incident impact, determine severity, and decide whether regulatory escalation is required. At the same time, over-automation in high-stakes environments can introduce new risks if actions are taken without sufficient evidence or human oversight.

We were inspired to explore how Elastic Agent Builder could be used to automate this messy, high-pressure workflow using real operational data and controlled actions, not prompt-based assumptions, while still preserving human authority and auditability.


What it does

AY Elastic Incident Triage Agent is a multi-step AI agent that analyzes production incidents using Elasticsearch logs and manages the incident lifecycle in a safe, auditable way. The agent retrieves recent ERROR and WARN events, correlates them by incident ID, identifies affected services and event categories, and classifies incident severity and compliance risk based strictly on observed data.

Beyond analysis, the agent integrates with MCP to create, reference, close, and reopen incident tickets. Sensitive actions such as closing or reopening incidents are protected by an explicit human CONFIRM gate, ensuring operators retain control when evidence is incomplete or ambiguous. The result is a fast, consistent, and production-safe incident triage workflow.


How we built it

We built the agent using Elastic Agent Builder with ES|QL-powered tools for log retrieval and aggregation. The agent follows a deterministic workflow: retrieving telemetry, assessing freshness, summarizing incidents by ID, and deriving severity, root cause, and compliance impact from evidence.

We integrated MCP as the authoritative control plane for incident state, allowing the agent to list, create, close, and reopen incidents through structured tools. A lightweight incident portal was built to demonstrate this integration and show how AI agents can safely take real operational actions beyond analysis alone.


Challenges we ran into

One challenge was designing ES|QL queries that were expressive yet schema-safe, especially when aggregating incidents without introducing assumptions. Another challenge was enforcing clear boundaries between automation and human authority, ensuring the agent could assist decisively without acting unsafely in regulated scenarios. This led to the introduction of an explicit confirmation mechanism for privileged actions.


Accomplishments that we're proud of

We built a fully tool-driven agent that grounds every conclusion in Elasticsearch data and treats MCP as the single source of truth for incident state. The agent avoids duplicate tickets, accounts for telemetry freshness, and enforces explicit human authorization for sensitive actions. The result is an explainable, audit-ready system suitable for real production use.


What we learned

We learned that Elastic Agent Builder excels at building controlled, production-grade AI agents. Tool-based reasoning forces clearer logic, better data modeling, and safer automation compared to prompt-only approaches, especially in compliance-sensitive environments.


What's next for AY Elastic Incident Triage Agent

Next, we plan to integrate with real enterprise ticketing systems, add time-series anomaly detection, and explore multi-agent workflows where one agent performs operational triage while another independently validates compliance and remediation decisions.

Share this project:

Updates