Inspiration
- High-stakes domains (research, law, journalism, policy) break down when models “sound right” but fabricate citations or overstate what sources actually say.
- Traditional RAG helps retrieval, but it doesn’t verify that the model’s claims accurately reflect the retrieved text.
- We wanted a single pipeline that: debates a claim from both sides, forces every argument to be sourced, and then audits those sources automatically.
What it does
The Devil’s Advocate & The Jury is an evidence-validation debate system:
- Input: a topic, hypothesis, or policy statement + selected knowledge base (medical, legal, academic, news, etc.)
Debate: two agents argue opposing sides using only documents retrieved from Elasticsearch:
- A1 (Proponent): builds the strongest evidence-backed case for the claim
- A2 (Devil’s Advocate): rebuts, probes for gaps, surfaces counter-evidence and bias
Verification: a Jury agent re-checks every citation by fetching the cited documents and validating whether the quoted/claimed support is real.
Output: a transparent verdict report:
- winner (or synthesis if both converge)
- score breakdown (logic, evidence quality, rebuttal strength, citation accuracy)
- citation audit trail (supported / weakly supported / unsupported / missing)
Core guarantee: no-hallucination contract — misquoted or invented citations are detected and penalized.
How we built it
Elasticsearch Indices (Ground Truth Store)
Designed a knowledge-base index optimized for retrieval + audit:
- Hybrid search: BM25 + semantic retrieval
- Chunked documents: each source split into passage chunks for precise citation
- Stored metadata for provenance (title, author, publication date, source URL, domain, license, etc.)
Added citation-friendly fields so agents can reference exact material:
text,title,authors,year,source,url
ELSER (Semantic Retrieval)
- Enabled ELSER-powered semantic search so agents can retrieve relevant evidence even when user phrasing doesn’t match keywords.
- Used hybrid ranking to keep “exact legal/statutory wording” strong while still catching semantic paraphrases.
Elastic Agent Builder (Debaters + Jury)
Implemented three personas as distinct agents:
- A1: evidence-first constructive case builder
- A2: adversarial rebuttal + bias/assumption detector
- Jury: citation auditor + multi-lens scorer (Empiricist / Philosopher / Economist / Historian / Humanist / Religious Scholar)
Exposed a single constrained tool to A1/A2:
search_knowledge_base(query, top_k, filters)returning chunks + IDs
Enforced a structured citation format that embeds IDs:
[Source: title | doc_id=... | chunk_id=... | year=...]
Elastic Workflows (Orchestration)
- Built a turn-based workflow that chains calls without a custom application server:
- Normalize user prompt → generate retrieval queries
- A1 retrieval → argument with citations
- A2 reads A1 → retrieval → rebuttal with citations
- Repeat for N turns or stop on convergence/entropy
- Jury runs verification + scoring + final report
Jury Verification with ES|QL (Citation Audit)
- For each cited
doc_id/chunk_id, Jury re-fetches the authoritative text using ES|QL (or equivalent retrieval step). Validates support using a layered approach:
- Existence check: cited IDs must exist
- Quote/claim alignment: compare claimed support to retrieved chunk text
- Semantic consistency: similarity check to detect paraphrase misrepresentation
Applies penalties:
- fabricated/missing source → heavy citation accuracy penalty
- weak or non-supporting quote → partial penalty + flagged in audit
Challenges we ran into
- Citation precision: getting agents to cite the exact chunk that supports the exact claim, not just “something nearby.”
- Chunking tradeoffs: chunks too small lose context; too large reduce retrieval accuracy and make audits ambiguous.
- Hallucination-resistant prompting: ensuring agents never “fill gaps” when retrieval is thin—especially in adversarial debate.
- Bias and source quality: “more citations” isn’t better if they’re low-quality or ideologically skewed.
- Scoring fairness: balancing evidence volume vs. evidence strength, and penalizing unsupported confidence.
- Latency/throughput: multi-turn debate + per-citation verification can be expensive without batching and caching.
Accomplishments that we're proud of
- End-to-end pipeline: debate → verification → scored verdict in a single orchestrated workflow.
- Hard anti-hallucination mechanism: citations are not just included—they’re audited against the original indexed text.
- Configurable Jury lenses: same debate can be evaluated through empirical, logical, economic, historical, and ethical frames.
- Transparent outputs: users can see exactly which claims were supported, weakly supported, or rejected.
- Hybrid retrieval with ELSER: improved recall for semantically related evidence while preserving keyword fidelity for technical domains.
What we learned
- Verification is not a UX feature—it’s the product: without an audit layer, RAG still allows confident misrepresentation.
- Good retrieval depends as much on index design and chunking as it does on embeddings.
- Multi-agent debate increases coverage, but without guardrails it can amplify confident errors—so post-hoc verification must be first-class.
- Scoring systems need to reward “correct uncertainty” and penalize “unsupported certainty,” not just rhetorical strength.
What's next for Devil's Advocate and Jury
- Stronger citation alignment: highlight exact supporting spans (offset-level), not just chunk-level references.
- Source quality weighting: automatic signals for peer-review status, jurisdictional authority, publisher reputation, recency, and conflicts of interest.
- Bias reporting module: structured disclosure (who funded, editorial stance, selection effects) rather than a vague “bias flag.”
- Active ingestion: on-demand crawling/upload + immediate indexing when debate needs missing coverage.
- User-facing UI in Kibana: interactive verdict report with clickable citations, side-by-side quote comparison, and per-claim audit status.
- Zotero/Mendeley integration: one-click knowledge base population and citation export.
- Multilingual support: multilingual embeddings + cross-language retrieval for global policy and legal research.
- Voice briefing mode: debate and verdict delivered as an audio summary with distinct agent voices.
Built With
- agent
- agent-builder
- elasticsearch
- elser
- indices
- workflows
Log in or sign up for Devpost to join the conversation.