QuantumState — Project Story

The Problem

At 3 AM, an SRE wakes up to a slow climb on a memory graph. The anomaly has to be noticed, investigated, traced to a cause, matched to a runbook, and remediated while the clock runs. By the time the first fix lands, 60 minutes of production impact have already passed.

The 3AM Problem

Automating that with AI usually means integration hell — LangChain, an external vector store, third-party LLM API keys, a custom orchestration layer — something more fragile than the manual process it was meant to replace, shipping sensitive production telemetry outside the cluster with every call.

QuantumState is a demonstration of what becomes possible when you build that intelligence natively inside Elastic. Four Agent Builder agents handle the complete incident lifecycle — detection through verification — without a human in the loop.

The Agent Swarm

The Agent Swarm

Cassandra detects anomalies using parameterised ES|QL queries against rolling baselines. Archaeologist correlates error logs and deployment events, then uses ELSER to search historical incidents by meaning. Surgeon retrieves the relevant runbook semantically and when confidence clears 0.8, calls the Kibana Workflow tool directly to trigger remediation. Guardian verifies recovery and either closes the incident with a calculated MTTR or escalates with full context attached.

Surgeon decides. The MCP Runner executes — polling Elasticsearch for approved remediation actions, performing the fix, and recording the result back into the cluster.

All four agents and their tools are built natively inside Elastic using Agent Builder — no external LLM APIs, no data leaving the cluster.

QuantumState Architecture

What Stood Out

Two things stood out. ELSER made a concrete, measurable difference — semantic incident recall without synonym lists, manual tagging, or brittle field mappings. And Agent Builder and the Kibana API made building and shipping agents seamless — everything lives exactly where the logs and metrics already are.

Challenges

The real challenges were in orchestration. Getting four agents to hand off cleanly — each agent's output becoming the next agent's structured context — required careful state management at every seam. A dropped field anywhere in the chain breaks everything downstream.

Choosing the right auto pipeline interval was less obvious than it sounds. Too frequent and you flood the system with redundant runs; too slow and the detection advantage disappears. The cooldown logic needed several iterations to feel right in practice.

Setting up the Kibana Workflow was its own challenge. The dependency chain isn't obvious until you've hit every silent failure: ELSER deployed before indices, indices before tool definitions, workflow ID in place before Surgeon can call it.

Built With

Share this project:

Updates