Inspiration

On-call engineers spend the majority of incident response time doing the same thing every time: querying logs to find where errors are concentrated, pulling up a runbook, deciding if it's safe to act. This is mechanical reasoning, not creative engineering — yet it still wakes people up at 2am. PagerPilot exists because the data to diagnose most incidents is already sitting in Elasticsearch. The question was whether an agent could do the reasoning and act safely, without a human in the loop for the routine cases.

The broader inspiration: every AIOps tool on the market pulls your data out of your observability stack into a separate platform. We wanted to prove the inverse — bring the intelligence to where the data already lives.

What it does

PagerPilot is an autonomous incident response agent built entirely on Elastic Agent Builder. When an alert fires:

  1. A Workflow receives the webhook and invokes the PagerPilot agent
  2. The agent runs three ES|QL queries against live log data: regional error concentration, latency anomaly detection, and error code fingerprinting
  3. It retrieves the relevant Standard Operating Procedure from a semantic search index using hybrid retrieval (BM25 + ELSER)
  4. It produces a RemediationPlan with a deterministic confidence score and structured evidence trail
  5. The Workflow's safety gate evaluates the plan: if confidence ≥ 0.90 and error count thresholds are met, it auto-remediates via a mock ops API; otherwise it sends an approval request to Slack with the full evidence summary
  6. The incident closes. Every decision — queries run, SOP cited, action taken — is auditable in the reasoning trace.

In the demo: a DB_CONNECTION_TIMEOUT outbreak on the checkout service goes from alert to connection pool restart in under 90 seconds, with no human intervention.

How we built it

  • Elastic Cloud (serverless): hosts all indexes and runs Agent Builder and Workflows
  • logs-service-prod index: synthetic operational logs generated by generate_traffic.py in Normal or Chaos mode using the Elasticsearch Bulk API
  • sops-wiki index: runbooks mapped with semantic_text fields, backed by ELSER for sparse embedding on ingest
  • Agent Builder: two tools configured via the Kibana UI and API — esql_analytics (three parameterized ES|QL templates) and sop_retriever (RRF hybrid retrieval using the semantic query type)
  • Elastic Workflows: webhook trigger → agent invocation → confidence-gated switch → mock HTTP remediation or Slack connector
  • Kibana: Discover for live log monitoring, Agent chat for trace inspection, Lens dashboard for MTTR metrics
  • All infrastructure is Elastic-native. No external orchestration frameworks, no LangChain, no custom agent loop.

Challenges we ran into

  1. Deterministic confidence without LLM self-reporting: LLMs are bad at calibrating their own confidence. We built a formula that computes confidence from query results — regional error concentration (45%), latency anomaly magnitude (35%), and SOP retrieval score (20%) — so the safety gate has a real signal, not a hallucinated probability.

  2. ES|QL template design vs free-form generation: Early prototypes tried to have the LLM generate ES|QL queries from natural language. This was brittle and unpredictable under test. Switching to intent-to-template mapping (three fixed queries, parameterized by service and time window) made the system reliable and the queries auditable.

  3. semantic_text query type: Using a standard match query on a semantic_text field silently bypasses the semantic model and falls back to lexical matching. The correct pattern is the semantic query type, wrapped in an RRF retriever for hybrid scoring. This was not obvious from surface-level documentation.

Accomplishments that we're proud of

  • The agent never guesses. Every root-cause hypothesis is backed by ES|QL query output, and the evidence is included verbatim in the remediation plan. Judges and operators can inspect exactly what data led to each decision.
  • The confidence gate is grounded in statistics, not vibes. confidence: 0.92 means something specific: 97% regional error concentration + 52x latency above baseline + SOP found.
  • The full loop — alert to remediation — runs in under 90 seconds in the demo, with the logs visibly recovering.
  • Zero external dependencies outside Elastic Cloud and a Python script. The entire stack is reproducible in under 10 minutes from a fresh Elastic Cloud trial.

What we learned

  • Elastic Agent Builder's bidirectional relationship between agents and workflows (agents trigger workflows, workflows invoke agents) is more powerful than it looks on paper. It's the right primitive for safety-gated autonomous action.
  • ES|QL is genuinely underused for incident analytics. The ability to do PERCENTILE, STATS BY, and EVAL inline, directly on log indexes, eliminates an entire class of custom alerting logic.
  • semantic_text with ELSER changes the economics of runbook retrieval. Ingesting a Markdown file and getting high-quality semantic search without a separate embedding pipeline is a real operational simplification.
  • Scope discipline is the main determinant of whether a hackathon project demos well. One incident class, one action, one decision branch — and it looks sharp. Three incident classes with partial implementations looks like a prototype.

What's next for PagerPilot

  • Multi-service coverage: extend ES|QL templates and SOP corpus to cover memory leak, pod crash loop, and certificate expiry incident classes
  • Feedback loop: after auto-remediation, run a validation query 5 minutes later to confirm error rate dropped; if not, re-escalate automatically
  • Human approval via Kibana: replace Slack approval with an in-platform approval workflow so the entire incident lifecycle stays in Elastic
  • Trace indexing: write the full reasoning_evidence chain to a dedicated incidents-audit index so post-incident reviews can query the agent's reasoning history with ES|QL
  • Live alert integration: replace the webhook trigger with a native Elastic alert rule trigger, closing the loop from detection to resolution entirely within the platform

Built With

Share this project:

Updates