LogSage

Error spike timeline showing distinct incident bursts across the week — each spike is a separate failure scenario detected by LogSage
Per-service error breakdown and average latency trend showing which services are most affected and how degradation progresses over time
AI-generated incident reports with verdict and confidence score, plus escalation audit trail showing which incidents triggered Slack alerts
Slack Block Kit escalation alert — root cause, affected services, and recommendation delivered automatically when confidence ≥ 70
LogSage Investigator agent responding to an investigation query — using ES|QL tools to detect anomalies directly from Elasticsearch

Inspiration

Every engineering team knows the pain: it's 2 AM, an alert fires, and an on-call engineer spends the next hour manually correlating logs across five different services to figure out if it's real. Alert fatigue is real, and the cost of that manual triage — in time, sleep, and missed incidents — is enormous.

I wanted to build a system where AI agents do the investigation, not the engineer.

What It Does

LogSage is a 4-phase multi-agent pipeline built on Elastic Agent Builder:

Investigator Agent — detects error spikes and latency anomalies across microservices using ES|QL time-series queries
Analyst Agent — correlates events across services, establishes a failure sequence, and generates a root cause hypothesis
Validator Agent — independently re-queries the data to verify the hypothesis and assigns a Verdict (CONFIRMED / PARTIALLY CONFIRMED / UNCONFIRMED) with a Confidence Score (0–100)
Auto-Escalation — if Confidence ≥ 70, automatically writes an audit record to Elasticsearch and sends a rich Slack Block Kit notification with root cause, affected services, and recommendation

Every incident report is persisted to Elasticsearch. A 6-panel Kibana dashboard provides full observability across raw logs, AI-generated incidents, and escalation history.

How I Built It

The three agents are created in Kibana Agent Builder with ES|QL and Search tools. A Python orchestrator (orchestrate.py) calls each agent sequentially via the A2A (Agent-to-Agent) JSON-RPC 2.0 endpoint, passing context between phases automatically. The Elastic Managed LLM handles all reasoning — no external API keys or token budgets needed.

A separate setup_dashboard.py script provisions all Kibana data views, Lens visualizations, and the dashboard automatically, making the project fully reproducible from a single command.

Challenges

The A2A endpoint occasionally returned 502/503 errors under load, which required implementing exponential backoff retry logic (up to 5 attempts) to make the pipeline reliable.

The Kibana Cases API is not available on Elasticsearch Serverless projects (only Security/Observability types), which led me to build a custom logsage-actions Elasticsearch index as a more flexible and queryable audit trail.

What I Learned

ES|QL is remarkably powerful for time-series anomaly detection — queries that would require complex multi-stage pipelines elsewhere are concise and fast. The Elastic Managed LLM made iteration extremely fast with zero configuration. Most importantly, the A2A protocol makes it straightforward to chain agents into real workflows rather than isolated chat sessions.

Built With

a2a-protocol
elastic-agent-builder
elastic-managed-llm
elasticsearch
es|ql
incoming
kibana
python
slack
webhooks

Updates

Sidal Bingöl started this project — Feb 27, 2026 02:58 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.