Inspiration

  • High-stakes domains (research, law, journalism, policy) break down when models “sound right” but fabricate citations or overstate what sources actually say.
  • Traditional RAG helps retrieval, but it doesn’t verify that the model’s claims accurately reflect the retrieved text.
  • We wanted a single pipeline that: debates a claim from both sides, forces every argument to be sourced, and then audits those sources automatically.

What it does

The Devil’s Advocate & The Jury is an evidence-validation debate system:

  • Input: a topic, hypothesis, or policy statement + selected knowledge base (medical, legal, academic, news, etc.)
  • Debate: two agents argue opposing sides using only documents retrieved from Elasticsearch:

    • A1 (Proponent): builds the strongest evidence-backed case for the claim
    • A2 (Devil’s Advocate): rebuts, probes for gaps, surfaces counter-evidence and bias
  • Verification: a Jury agent re-checks every citation by fetching the cited documents and validating whether the quoted/claimed support is real.

  • Output: a transparent verdict report:

    • winner (or synthesis if both converge)
    • score breakdown (logic, evidence quality, rebuttal strength, citation accuracy)
    • citation audit trail (supported / weakly supported / unsupported / missing)

Core guarantee: no-hallucination contract — misquoted or invented citations are detected and penalized.

How we built it

Elasticsearch Indices (Ground Truth Store)

  • Designed a knowledge-base index optimized for retrieval + audit:

    • Hybrid search: BM25 + semantic retrieval
    • Chunked documents: each source split into passage chunks for precise citation
    • Stored metadata for provenance (title, author, publication date, source URL, domain, license, etc.)
  • Added citation-friendly fields so agents can reference exact material:

    • text, title, authors, year, source, url

ELSER (Semantic Retrieval)

  • Enabled ELSER-powered semantic search so agents can retrieve relevant evidence even when user phrasing doesn’t match keywords.
  • Used hybrid ranking to keep “exact legal/statutory wording” strong while still catching semantic paraphrases.

Elastic Agent Builder (Debaters + Jury)

  • Implemented three personas as distinct agents:

    • A1: evidence-first constructive case builder
    • A2: adversarial rebuttal + bias/assumption detector
    • Jury: citation auditor + multi-lens scorer (Empiricist / Philosopher / Economist / Historian / Humanist / Religious Scholar)
  • Exposed a single constrained tool to A1/A2:

    • search_knowledge_base(query, top_k, filters) returning chunks + IDs
  • Enforced a structured citation format that embeds IDs:

    • [Source: title | doc_id=... | chunk_id=... | year=...]

Elastic Workflows (Orchestration)

  • Built a turn-based workflow that chains calls without a custom application server:
  1. Normalize user prompt → generate retrieval queries
  2. A1 retrieval → argument with citations
  3. A2 reads A1 → retrieval → rebuttal with citations
  4. Repeat for N turns or stop on convergence/entropy
  5. Jury runs verification + scoring + final report

Jury Verification with ES|QL (Citation Audit)

  • For each cited doc_id/chunk_id, Jury re-fetches the authoritative text using ES|QL (or equivalent retrieval step).
  • Validates support using a layered approach:

    • Existence check: cited IDs must exist
    • Quote/claim alignment: compare claimed support to retrieved chunk text
    • Semantic consistency: similarity check to detect paraphrase misrepresentation
  • Applies penalties:

    • fabricated/missing source → heavy citation accuracy penalty
    • weak or non-supporting quote → partial penalty + flagged in audit

Challenges we ran into

  • Citation precision: getting agents to cite the exact chunk that supports the exact claim, not just “something nearby.”
  • Chunking tradeoffs: chunks too small lose context; too large reduce retrieval accuracy and make audits ambiguous.
  • Hallucination-resistant prompting: ensuring agents never “fill gaps” when retrieval is thin—especially in adversarial debate.
  • Bias and source quality: “more citations” isn’t better if they’re low-quality or ideologically skewed.
  • Scoring fairness: balancing evidence volume vs. evidence strength, and penalizing unsupported confidence.
  • Latency/throughput: multi-turn debate + per-citation verification can be expensive without batching and caching.

Accomplishments that we're proud of

  • End-to-end pipeline: debate → verification → scored verdict in a single orchestrated workflow.
  • Hard anti-hallucination mechanism: citations are not just included—they’re audited against the original indexed text.
  • Configurable Jury lenses: same debate can be evaluated through empirical, logical, economic, historical, and ethical frames.
  • Transparent outputs: users can see exactly which claims were supported, weakly supported, or rejected.
  • Hybrid retrieval with ELSER: improved recall for semantically related evidence while preserving keyword fidelity for technical domains.

What we learned

  • Verification is not a UX feature—it’s the product: without an audit layer, RAG still allows confident misrepresentation.
  • Good retrieval depends as much on index design and chunking as it does on embeddings.
  • Multi-agent debate increases coverage, but without guardrails it can amplify confident errors—so post-hoc verification must be first-class.
  • Scoring systems need to reward “correct uncertainty” and penalize “unsupported certainty,” not just rhetorical strength.

What's next for Devil's Advocate and Jury

  • Stronger citation alignment: highlight exact supporting spans (offset-level), not just chunk-level references.
  • Source quality weighting: automatic signals for peer-review status, jurisdictional authority, publisher reputation, recency, and conflicts of interest.
  • Bias reporting module: structured disclosure (who funded, editorial stance, selection effects) rather than a vague “bias flag.”
  • Active ingestion: on-demand crawling/upload + immediate indexing when debate needs missing coverage.
  • User-facing UI in Kibana: interactive verdict report with clickable citations, side-by-side quote comparison, and per-claim audit status.
  • Zotero/Mendeley integration: one-click knowledge base population and citation export.
  • Multilingual support: multilingual embeddings + cross-language retrieval for global policy and legal research.
  • Voice briefing mode: debate and verdict delivered as an audio summary with distinct agent voices.

Built With

Share this project:

Updates