Inspiration/Problem

When production incidents strike, DevOps and SRE teams face a critical time crunch. Engineers spend 30-60 minutes manually searching through thousands of log entries, correlating events across microservices, and piecing together what went wrong. This manual investigation process delays resolution, extends downtime, and frustrates both engineers and users.

The Solution

LogSleuth is an AI-powered incident investigation agent built with Elastic Agent Builder and Elasticsearch. It automates the entire investigation workflow, from initial alert to root cause analysis, reducing mean time to resolution (MTTR) from 30+ minutes to under 5 minutes.

How It Works

LogSleuth follows a structured investigation process:

  1. Search - Queries Elasticsearch for error logs matching the incident description
  2. Analyze - Identifies error patterns and detects anomaly spikes
  3. Correlate - Traces requests across services using distributed trace IDs
  4. Synthesize - Generates a comprehensive report with root cause, affected services, timeline, and remediation suggestions

Key Features

  • 6 Custom Tools: search_logs, get_error_frequency, find_correlated_logs, find_error_traces, search_past_incidents, save_investigation
  • Multi-Service Correlation: Traces errors across microservices using trace IDs
  • Knowledge Base: Saves investigations for future reference and pattern matching
  • Interactive Dashboard: Streamlit-based UI with real-time metrics and visualizations
  • CLI Interface: Full command-line access for terminal-based workflows

Technical Implementation

  • Built on Elastic Agent Builder with ES|QL-powered tools
  • ECS-compatible log schema for standardized data
  • Elasticsearch for high-performance log search and aggregations
  • Streamlit dashboard with Plotly visualizations

Impact

LogSleuth transforms incident response from a manual, time-consuming process into an automated, intelligent workflow. By leveraging Elasticsearch's search capabilities through Agent Builder, teams can resolve incidents faster, reduce downtime, and focus on prevention rather than firefighting.


Features Used

  • Elastic Agent Builder (custom agent + tools)
  • Elasticsearch (data storage, search, aggregations)
  • ES|QL queries
  • ECS-compatible log schema

Challenges & Learnings

  1. Challenge: Designing ES|QL queries that work as reusable agent tools Learning: Parameterized queries with clear descriptions help the LLM select the right tool

  2. Challenge: Correlating logs across distributed services Learning: Trace IDs are essential; the tool design must support iterative investigation

  3. Challenge: Making the agent's reasoning transparent Learning: Structured output formats (timelines, tables) make findings actionable

What I Liked

  1. Agent Builder's Tool Framework: Converting ES|QL queries into callable tools is elegant and powerful
  2. Elasticsearch Performance: Sub-second query responses even with complex aggregations
  3. Flexibility: The agent can handle open-ended incident descriptions and adapt its investigation approach

Wrote an article about it, read here: Building LogSleuth

Built With

Share this project:

Updates