MARS: Multi-Agent Research Synthesizer

Inspiration

Every engineer who has been paged at 2 AM knows the drill. Latency is spiking. You open five dashboards simultaneously: metrics, logs, deployments, Slack, runbooks. You're cross-referencing timestamps manually, trying to figure out if that deploy 20 minutes ago caused this. It takes 30-60 minutes to get a confident answer, and by then the incident has already cost you.

The frustrating part is that all the evidence exists. It's in Elasticsearch. It just needs to be retrieved, cross-referenced, and reconciled systematically. That's a job for agents, not humans.

What We Built

MARS (Multi-Agent Research Synthesizer) is a multi-step AI agent system that takes any operational question and returns a fully sourced, conflict-resolved answer in under 60 seconds.

The core insight is that different evidence sources have different reliability. Raw ES|QL data from time-series metrics is ground truth. Internal documentation is useful context. External web sources provide corroboration. When these sources contradict each other and they often do MARS automatically resolves the conflict using a trust hierarchy: ES|QL data always wins.

The 5-Agent Pipeline

1. Planner Agent : Calls Elastic Agent Builder (mars-research-synthesizer) which fires four custom tools: mars.spike_detector, mars.deploy_lookup, mars.doc_search, and mars.runbook_search. Claude Opus 4.5 reasons over the results and returns a structured narrative.

2. ES|QL Verifier : Executes parameterized ES|QL templates against time-series metrics and deployment data. Writes high-confidence claims (90-95%) to the shared Claim Ledger.

3. Retrieval Agent : Runs BM25 hybrid search over internal incident tickets and runbooks. Writes medium-confidence claims (70-88%) to the Claim Ledger.

4. Web Scout : Uses Tavily to search the web for external corroboration of internal findings. Writes low-confidence claims (62-68%) to the Claim Ledger.

5. Reviewer Agent : Reads all claims from the Claim Ledger, detects contradictions between sources, resolves them deterministically, and fires targeted follow-up ES|QL queries for any weak evidence.

The Claim Ledger

The architectural centrepiece is the Claim Ledger an Elasticsearch index where every agent writes structured evidence records. Each claim has a source_type, confidence, status, and conflicts_with field. Agents are fully decoupled and communicate only through this shared artifact. The Reviewer reads everything and decides what to trust.

Planted Contradictions

To demonstrate conflict detection, the demo data contains two deliberate contradictions:

Incident ticket INC-2041 records the spike starting at 14:45 UTC. ES|QL data shows 14:25 UTC. ES|QL wins.
Runbook RB-0034 states the DB pool maximum is 50 connections. ES|QL data shows 100. ES|QL wins.

Both are caught automatically, highlighted in the Evidence Heatmap with red pulsing cells, and the resolution reasoning is displayed in full.

Multi-Source Toggle

MARS works across three data sources switchable live from the UI:

Demo Data - synthetic incident with 335k+ documents and planted contradictions
Sample Web Logs - real Kibana sample nginx data (14k requests, real 404/503 errors)
Sample eCommerce - real Kibana sample order data (4.6k transactions, revenue trends)

Each source uses a dedicated Agent Builder agent with ES|QL tools pointing at the correct indices.

Challenges

Conflict detection without an LLM - The Reviewer agent runs entirely in Python with no LLM calls. Writing deterministic rules that catch timestamp contradictions and numeric value disagreements across free-text claims required careful pattern matching and was harder than expected.

Real-time heatmap - Making claims appear row by row as each agent writes them required background threading in FastAPI, status polling, and careful handling of the Chart.js canvas lifecycle to avoid "canvas already in use" errors.

Elastic Cloud ES|QL differences - Several ES|QL patterns that work locally don't work on Elastic Cloud Serverless. COUNT(CASE WHEN ...) isn't supported. Shard/replica settings cause errors. Learning these differences cost significant time.

Auto follow-up loop termination - The follow-up query system needed strict termination conditions to prevent infinite loops: max 3 iterations per claim, duplicate query detection, and skipping claims that were already resolved by conflict detection.

What We Learned

Elastic Agent Builder is genuinely powerful for reasoning over structured data - Claude Opus's analysis of ES|QL results is remarkably accurate
The trust hierarchy approach to multi-source conflict resolution is more robust than asking an LLM to arbitrate
ES|QL is excellent for time-series incident investigation - the ability to bucket, filter, and aggregate in a single query is exactly what incident analysis needs
Decoupling agents through a shared ledger (rather than function calls) makes the system dramatically easier to debug and extend