Inspiration/Problem
When production incidents strike, DevOps and SRE teams face a critical time crunch. Engineers spend 30-60 minutes manually searching through thousands of log entries, correlating events across microservices, and piecing together what went wrong. This manual investigation process delays resolution, extends downtime, and frustrates both engineers and users.
The Solution
LogSleuth is an AI-powered incident investigation agent built with Elastic Agent Builder and Elasticsearch. It automates the entire investigation workflow, from initial alert to root cause analysis, reducing mean time to resolution (MTTR) from 30+ minutes to under 5 minutes.
How It Works
LogSleuth follows a structured investigation process:
- Search - Queries Elasticsearch for error logs matching the incident description
- Analyze - Identifies error patterns and detects anomaly spikes
- Correlate - Traces requests across services using distributed trace IDs
- Synthesize - Generates a comprehensive report with root cause, affected services, timeline, and remediation suggestions
Key Features
- 6 Custom Tools: search_logs, get_error_frequency, find_correlated_logs, find_error_traces, search_past_incidents, save_investigation
- Multi-Service Correlation: Traces errors across microservices using trace IDs
- Knowledge Base: Saves investigations for future reference and pattern matching
- Interactive Dashboard: Streamlit-based UI with real-time metrics and visualizations
- CLI Interface: Full command-line access for terminal-based workflows
Technical Implementation
- Built on Elastic Agent Builder with ES|QL-powered tools
- ECS-compatible log schema for standardized data
- Elasticsearch for high-performance log search and aggregations
- Streamlit dashboard with Plotly visualizations
Impact
LogSleuth transforms incident response from a manual, time-consuming process into an automated, intelligent workflow. By leveraging Elasticsearch's search capabilities through Agent Builder, teams can resolve incidents faster, reduce downtime, and focus on prevention rather than firefighting.
Features Used
- Elastic Agent Builder (custom agent + tools)
- Elasticsearch (data storage, search, aggregations)
- ES|QL queries
- ECS-compatible log schema
Challenges & Learnings
Challenge: Designing ES|QL queries that work as reusable agent tools Learning: Parameterized queries with clear descriptions help the LLM select the right tool
Challenge: Correlating logs across distributed services Learning: Trace IDs are essential; the tool design must support iterative investigation
Challenge: Making the agent's reasoning transparent Learning: Structured output formats (timelines, tables) make findings actionable
What I Liked
- Agent Builder's Tool Framework: Converting ES|QL queries into callable tools is elegant and powerful
- Elasticsearch Performance: Sub-second query responses even with complex aggregations
- Flexibility: The agent can handle open-ended incident descriptions and adapt its investigation approach
Wrote an article about it, read here: Building LogSleuth
Built With
- ecs
- elastic-agent-builder
- elasticsearch
- plotly
- python
- rich
- streamlit
Log in or sign up for Devpost to join the conversation.