Logsleuth

Inspiration/Problem

When production incidents strike, DevOps and SRE teams face a critical time crunch. Engineers spend 30-60 minutes manually searching through thousands of log entries, correlating events across microservices, and piecing together what went wrong. This manual investigation process delays resolution, extends downtime, and frustrates both engineers and users.

The Solution

LogSleuth is an AI-powered incident investigation agent built with Elastic Agent Builder and Elasticsearch. It automates the entire investigation workflow, from initial alert to root cause analysis, reducing mean time to resolution (MTTR) from 30+ minutes to under 5 minutes.

How It Works

LogSleuth follows a structured investigation process:

Search - Queries Elasticsearch for error logs matching the incident description
Analyze - Identifies error patterns and detects anomaly spikes
Correlate - Traces requests across services using distributed trace IDs
Synthesize - Generates a comprehensive report with root cause, affected services, timeline, and remediation suggestions

Key Features

6 Custom Tools: search_logs, get_error_frequency, find_correlated_logs, find_error_traces, search_past_incidents, save_investigation
Multi-Service Correlation: Traces errors across microservices using trace IDs
Knowledge Base: Saves investigations for future reference and pattern matching
Interactive Dashboard: Streamlit-based UI with real-time metrics and visualizations
CLI Interface: Full command-line access for terminal-based workflows

Technical Implementation

Built on Elastic Agent Builder with ES|QL-powered tools
ECS-compatible log schema for standardized data
Elasticsearch for high-performance log search and aggregations
Streamlit dashboard with Plotly visualizations

Impact

LogSleuth transforms incident response from a manual, time-consuming process into an automated, intelligent workflow. By leveraging Elasticsearch's search capabilities through Agent Builder, teams can resolve incidents faster, reduce downtime, and focus on prevention rather than firefighting.

Features Used

Elastic Agent Builder (custom agent + tools)
Elasticsearch (data storage, search, aggregations)
ES|QL queries
ECS-compatible log schema

Challenges & Learnings

Challenge: Designing ES|QL queries that work as reusable agent tools Learning: Parameterized queries with clear descriptions help the LLM select the right tool
Challenge: Correlating logs across distributed services Learning: Trace IDs are essential; the tool design must support iterative investigation
Challenge: Making the agent's reasoning transparent Learning: Structured output formats (timelines, tables) make findings actionable

What I Liked

Agent Builder's Tool Framework: Converting ES|QL queries into callable tools is elegant and powerful
Elasticsearch Performance: Sub-second query responses even with complex aggregations
Flexibility: The agent can handle open-ended incident descriptions and adapt its investigation approach

Wrote an article about it, read here: Building LogSleuth