Problem Solved

Production incidents cost engineering teams hours of manual investigation — querying dashboards, correlating logs across services, forming hypotheses, and verifying fixes. SENTINEL automates this entire lifecycle by deploying five specialized AI agents that work through your actual Elasticsearch data autonomously.

What We Built

SENTINEL deploys five real Agent Builder agents to your Kibana instance, each with a distinct role: Planner (decomposes incidents into investigation tasks), Investigator (runs targeted ES|QL queries to isolate root causes), Correlator (maps blast radius across services), Remediator (generates executable remediation commands with risk assessments), and Verifier (validates fixes with data-driven queries and closes the incident). The agents chain sequentially, each passing structured findings to the next.

Beyond the pipeline, SENTINEL includes an AI Tool Creator — describe a monitoring tool in plain English and it generates a fully-typed, deployment-ready Agent Builder tool definition with valid ES|QL queries and parameter schemas. There's also a Natural Language → ES|QL query composer, a Monaco-powered ES|QL workbench, and a geographic node health map pulling live data from /_nodes/stats.

Agent Builder Features Used

  • Agent CRUD API (POST /api/agent_builder/agents) for one-click deployment of all five agents
  • SSE streaming via POST /api/agent_builder/converse/async — parsing reasoning, tool_call, tool_progress, message_chunk, and message_complete events in real-time
  • All five core tools: execute_esql, search, list_indices, get_index_mapping, get_document_by_id
  • Tool CRUD API for the AI Tool Creator's deploy-to-Kibana workflow
  • Conversation management and connector selection for multi-model support

Features We Liked

The SSE streaming protocol is excellent. Getting real-time reasoning events lets us show the agent's chain-of-thought live — users can watch the agent think, see which tools it calls, and follow the investigation as it happens. That transparency builds trust in a way that batch responses never could.

The execute_esql tool is incredibly powerful for incident investigation. Agents can run complex analytical queries across any index pattern without us having to build custom integrations for each data source.

Challenge

The search tool can run analytical queries but cannot execute write operations like _delete_by_query. We solved this by having the Remediator output exact ES API commands in structured format, then built executable command cards in the UI that parse those commands from the reasoning trace and let operators run them with one click — with risk badges (LOW/MED/HIGH) and inline result display.

Inspiration

On-call engineers spend hours manually correlating logs, metrics, and traces across distributed systems during production incidents. The average Mean Time To Resolution (MTTR) for P1 incidents is 45+ minutes — most of that is human toil: querying dashboards, forming hypotheses, checking dependencies, and verifying fixes. We asked: what if a pipeline of specialized AI agents could do all of that autonomously, using the same Elasticsearch data engineers already rely on? Elastic Agent Builder gave us the foundation to make that real.

What it does

SENTINEL is an autonomous incident intelligence platform that deploys five specialized AI agents to your Elastic Agent Builder instance. Each agent has a distinct role in the incident response lifecycle:

  • Planner — decomposes the incident into an investigation plan with specific ES|QL queries
  • Investigator — executes those queries against your live Elasticsearch data to isolate the root cause
  • Correlator — maps blast radius across dependent services and indices
  • Remediator — generates executable remediation commands with risk assessments and rollback procedures
  • Verifier — validates the fix with data-driven verification queries and closes the incident

Beyond the pipeline, SENTINEL includes a Natural Language → ES|QL query composer, an AI Tool Creator that generates deployment-ready Agent Builder tool definitions from plain English, a Monaco-powered ES|QL workbench, geographic node health mapping, and a real-time agent orchestration dashboard that streams each agent's chain-of-thought reasoning live.

How we built it

The core pipeline (useLiveIncidentRunner) scans the user's Elasticsearch cluster using three fallback strategies (_cat/indices, ES|QL FROM * METADATA _index, raw _cat), synthesizes a contextual incident from whatever data it finds, then runs all five agents sequentially via POST /api/agent_builder/converse/async. Each agent's SSE stream is parsed in real-time — reasoning, tool_call, tool_progress, message_chunk, and message_complete events are mapped to live UI state.

The AI Tool Creator uses an existing deployed agent as a reasoning backend — it sends a structured system prompt that enforces valid ES|QL syntax, correct parameter types (keyword, long, date, etc.), and the Agent Builder tool schema. The output is parsed, validated, and rendered in a review panel with Monaco editing before one-click deployment.

I've made sure SENTINEL works with any Elasticsearch cluster, any data.

Challenges we ran into

The SSE streaming protocol was the biggest technical challenge. Agent Builder's event stream includes multiple event types (reasoning, tool_call, tool_progress, tool_result, message_chunk, message_complete) that arrive in unpredictable order with partial chunks. Building a reliable parser that buffers correctly, maps events to the right UI state, and handles abort/reconnect took significant iteration.

Another challenge: the platform.core.search tool can run analytical queries but cannot execute write operations like _delete_by_query. We had to redesign the Remediator agent to output exact Elasticsearch API commands in structured format, then built an executable command card system in the UI that parses those commands from the reasoning trace and lets operators execute them with one click — complete with risk badges and result display.

Making the pipeline work generically (no hardcoded indices) required three fallback cluster scanning strategies and dynamic incident synthesis based on whatever data patterns exist in the user's cluster.

Accomplishments that we're proud of

  • Five real Agent Builder agents deployed and orchestrated autonomously — not a simulation
  • The AI Tool Creator: natural language → fully-typed, deployment-ready ES|QL tool definition in one shot
  • Executable command cards that parse ES API commands from agent reasoning and let you run them inline
  • Zero hardcoded index names — works with any Elasticsearch cluster out of the box
  • Real-time streaming of agent chain-of-thought reasoning with tool call visualization

What we learned

Agent orchestration is fundamentally a prompt engineering challenge. Each agent in the pipeline needs carefully scoped instructions — too broad and they hallucinate irrelevant queries, too narrow and they miss important patterns. The handoff context between agents (what the Planner found → what the Investigator should query) matters as much as the individual prompts.

We also learned that ES|QL is remarkably powerful for incident investigation when paired with an AI reasoning layer. Queries like FROM * METADATA _index | STATS doc_count = COUNT(*) BY _index give agents a complete cluster overview in one call, and the BUCKET(@timestamp, N minutes) function makes time-series anomaly detection straightforward.

What's next for SENTINEL

  • Parallel agent execution — run Investigator and Correlator simultaneously instead of sequentially
  • Feedback loops — let the Verifier trigger a re-run of the Remediator if verification fails
  • Custom agent pipelines — drag-and-drop pipeline builder where users define their own agent roles and handoff logic
  • Alerting integration — connect to Elastic Alerting rules to trigger the pipeline automatically on real incidents
  • Multi-cluster support — orchestrate across multiple Elasticsearch deployments from a single SENTINEL instance

Built With

Share this project:

Updates