Inspiration Modern cloud systems produce massive volumes of logs, metrics, and deployment events. When incidents occur, engineers must manually search through multiple dashboards and log streams to identify the root cause. This process is slow, stressful, and often requires multiple engineers. We wanted to explore how context-driven AI agents could automate incident investigation using Elasticsearch as the retrieval and reasoning backbone. What it does LogSage is a multi-step AI incident investigation agent that automatically analyzes production incidents. When an alert occurs, the agent retrieves relevant logs from Elasticsearch, runs analytics queries, correlates deployments with error spikes, and determines the most likely root cause. The system then produces a clear explanation and suggests corrective actions such as restarting services, rolling back deployments, or creating incident tickets. Instead of manually searching logs, engineers can trigger LogSage and receive an investigation report within seconds. How we built it LogSage uses Elastic Agent Builder to orchestrate a reasoning agent connected to multiple tools. The agent has access to: Search Tool – retrieves relevant logs from Elasticsearch ES|QL Tool – analyzes patterns in logs and metrics Workflow Tool – executes operational tasks like rollback or ticket creation A React dashboard allows users to view incidents, trigger investigations, and review the agent’s reasoning steps. The backend orchestrates agent workflows using Node.js APIs connected to Elasticsearch indexes containing logs, metrics, and deployment events. Challenges we ran into One challenge was designing a reliable multi-step reasoning flow that balances search retrieval and analytical queries. We experimented with different ES|QL queries to identify meaningful patterns such as sudden spikes in errors after deployments. Another challenge was structuring logs in a way that allowed the agent to correlate events across time and services. Accomplishments we’re proud of We built a working context-driven AI operations agent that demonstrates how LLM reasoning combined with Elasticsearch retrieval can dramatically reduce incident investigation time. LogSage transforms raw observability data into actionable insights in seconds. What we learned We learned how powerful Elasticsearch becomes when used as the memory and context engine for AI agents. Hybrid search and ES|QL allow agents to reason over operational data in ways that traditional dashboards cannot. What’s next Future versions of LogSage will include: anomaly detection automatic remediation workflows integration with Slack and GitHub collaborative multi-agent verification
Built With
- all
Log in or sign up for Devpost to join the conversation.