Inspiration

Every SRE has lived this nightmare: 2 AM pager goes off, dashboards light up red, and you're manually drilling through traces, cross-referencing deployment logs, and mentally correlating timestamps — all before you can even hypothesize what went wrong. We timed ourselves: 20-30 minutes of skilled human effort per incident, every single time.

We asked ourselves: what if Elastic Agent Builder could do this entire workflow autonomously? Not just detect the anomaly, but investigate it, correlate it with deployments, suggest a fix, apply it in a sandbox, and prove the fix works — all before a human even opens their laptop?

That's AutoTrace.

What it does

AutoTrace is an open-source SDK + orchestration layer that transforms Elastic Agent Builder into an autonomous incident investigator for OpenTelemetry-instrumented distributed systems. It provisions 6 custom ES|QL tools and a 7-agent LangGraph pipeline that:

  1. DETECT — Queries p95 latency and error rates per service via custom ES|QL tools
  2. INVESTIGATE — Drills into slow spans, reconstructs trace chains using MCP tool calls
  3. CORRELATE — Checks a deployment events index for temporal correlation with the anomaly onset
  4. SYNTHESIZE — Produces a ranked root cause hypothesis with confidence level (via Gemini)
  5. FIX — Classifies the root cause into 5 fix profiles (connection reuse, slow query, pool exhaustion, blocking I/O, legacy fault) and generates targeted environment overrides
  6. APPLY — Restarts the faulty service with the targeted fix via Docker Compose
  7. VERIFY — Measures before/after p95 latency and computes real improvement percentage

Total: ~30 seconds, zero human effort. Resolutions are stored in autotrace-resolutions with full before/after metrics, so every fix is auditable and replayable.

It also includes a Code Pipeline — an LLM-powered static analysis engine that scans any Python codebase for real bugs (async/await misuse, connection leaks, blocking I/O in async handlers), generates fixes, runs tests, and creates GitHub PRs — all orchestrated by LangGraph.

How we built it

We built AutoTrace as a Python SDK that deeply integrates with the Elastic stack:

  • Elastic Agent Builder — We programmatically provision 6 custom ES|QL tools and an investigation agent via the Kibana REST API. Each tool is a parameterized ES|QL query against OTel trace data (traces-generic-default) and deployment events.

  • MCP (Model Context Protocol) — We implemented a full JSON-RPC 2.0 MCP client that connects to the Kibana Agent Builder MCP endpoint. The investigation agent uses this to make real tool calls — detect_latency_anomaly, analyze_slow_spans, latency_timeline, etc.

  • LangGraph — Our v2 pipeline is a state graph with 7 specialized nodes. Each agent has a clearly defined role and passes structured state to the next. The pipeline supports streaming progress to terminal and Slack, plus human-in-the-loop via Slack messages or a local file.

  • Gemini — Used for the Planner (strategy selection), Synthesis (report summarization), and the Code Pipeline (bug analysis and fix generation), all via langchain-google-genai.

  • Demo System — 5 FastAPI microservices with OTel auto-instrumentation and real, config-driven bugs inspired by actual production incidents (from encode/httpx#2139, asyncio best practices docs). These aren't toy faults — they're patterns that cause real outages.

  • Sandbox Validation — Real HTTP latency measurements using httpx, real docker compose restarts, and real improvement percentage calculations. Nothing is mocked.

Challenges we ran into

  • MCP Protocol Quirks — Agent Builder replaces dots with underscores in tool names during MCP registration (autotrace.detect_latency_anomalyautotrace_detect_latency_anomaly). We had to build a bidirectional name mapping layer to handle this seamlessly.

  • LLM Tool Call Reliability — Gemini sometimes returns no tool calls or hallucinates tool names. We built fallback logic: if the LLM doesn't call tools, we execute a deterministic tool sequence and build a structured report from the raw results.

  • ES|QL Parameter Binding — Getting parameterized ES|QL queries to work correctly with the Agent Builder tool framework required careful handling of ?since timestamp parameters and now-30m defaults.

  • Async Bug Accuracy — Service E alone has 12+ potential concurrent-access patterns. Ensuring the LLM code scanner catches real bugs (not false positives) required extensive prompt engineering and structured output parsing.

  • End-to-End Validation — The hardest part was making the sandbox validation genuinely meaningful. We needed real latency deltas (not simulated ones) to prove fixes work, which meant carefully orchestrating Docker restarts, warmup periods, and measurement windows.

Accomplishments that we're proud of

  • Zero-mock architecture — Every API call is real. autotrace verify runs 8 live checks against Kibana, Elasticsearch, MCP, and Docker. No test doubles, no simulated responses.

  • Targeted fixes, not shotgun remediation — The Fix Agent doesn't just restart services. It classifies the root cause and applies only the relevant env override (e.g., HTTP_CLIENT_REUSE=true for connection reuse issues), so you can see exactly what fixed the problem.

  • Full audit trail — Three Elasticsearch indices store everything: investigation reasoning (autotrace-findings), resolution traces with before/after metrics (autotrace-resolutions), and pipeline traceability events (autotrace-agent-traces).

  • Two independent pipelines — The Trace/RCA pipeline and Code pipeline are cleanly separated. The code pipeline works on any local Python project — it doesn't need Elastic at all.

  • autotrace setupautotrace investigate — Two commands to go from zero to a fully autonomous investigation. The entire toolkit is provisioned programmatically.

What we learned

  • Agent Builder's MCP server is a powerful primitive — once you provision custom ES|QL tools, you can build arbitrarily complex investigation workflows on top of it.

  • LangGraph's state graph model maps perfectly to incident response workflows where each step depends on the previous step's output.

  • Real config-driven bugs (connection reuse disabled, tiny connection pools, blocking I/O in async handlers) are far more valuable for demos than synthetic sleep() faults — they produce realistic trace patterns that the agent can genuinely diagnose.

  • The gap between "detecting an anomaly" and "proving you fixed it" is huge. Sandbox validation with real latency measurements is what turns a detection tool into a resolution tool.

Built With

Share this project:

Updates