Confab

Inspiration

LLMs are rapidly becoming default study tools, tutoring assistants, and content generators in education. When they confabulate, students often have no way to tell. The standard mitigation is "check the sources," but that assumes the student already knows enough to evaluate the answer, which defeats the purpose of asking in the first place.

We wanted to attack the problem from the model side: is there a signal inside the model's hidden states that distinguishes confabulated answers from factual ones? If so, that signal could be used to flag unreliable outputs before they reach a student.

We also suspected that the benchmarks commonly used to evaluate hallucination detectors were themselves misleading. On raw HaluEval QA, response length alone gets ~0.97 AUC, meaning a detector can score well without learning anything about truthfulness. If evaluation tools are built on broken benchmarks, they provide false confidence rather than real safety.

What it does

Confab extracts hidden-state update vectors from an LLM's forward pass and computes Update Direction Coherence (UDC): the average cosine similarity between consecutive layer-to-layer update vectors across the layer stack. The core idea is that when a model processes a factual answer, neighboring updates tend to point in similar directions, and when it confabulates, those directions diverge.

We pair this with BENCH-2, a length-controlled subset of HaluEval QA where factual and hallucinated answers for each question differ by at most 2 tokens, eliminating the length shortcut entirely.

On Gemma 4 E2B:

udc_median_tok AUC on BENCH-2: 0.7429
Partial AUC after length control: 0.7363
BENCH-2 length baseline: 0.5632
TruthfulQA AUC: 0.5008 (expected, since TruthfulQA tests internalized misconceptions rather than internal conflict)

The near-chance TruthfulQA result defines an important scope boundary: UDC detects confabulation (the model fighting itself), not misconception (the model smoothly producing a false belief it has internalized). For educational applications, this means the method is best suited for catching cases where the model generates plausible-sounding answers it does not internally support, which is exactly the failure mode most likely to mislead students.

The demo app includes curated factual vs. confabulated examples with token-level UDC heatmaps, focused hidden-state geometry visualizations, and optional Gemini-powered natural-language explanations. A separate benchmark audit tab presents the shortcut analysis, BENCH-2 results, and TruthfulQA scope boundary.

How we built it

The core engine (udc_engine.py) hooks into a HuggingFace Transformers forward pass to extract per-layer hidden states, compute layer-to-layer update vectors $\delta^l = h^{(l+1)} - h^l$, and calculate pairwise cosine similarities between consecutive updates. The per-token UDC scores are aggregated via median across answer tokens.

We ran all Gemma experiments on Google Colab (L4 GPUs). The pipeline goes: smoke test, BENCH-2 benchmark run, calibration threshold fitting, feature sweep across all metrics, demo case generation, and 3D geometry precomputation.

The demo app is built in Streamlit with two tabs: a curated demo tab with token-level heatmaps and focused hidden-state geometry visualizations, and a benchmark audit tab presenting the shortcut analysis and controlled results. An optional Gemini API layer provides natural-language explanations of the UDC scores.

Challenges

The biggest challenge was the benchmark itself. Raw HaluEval looked like it validated everything, but that validation was hollow. Building BENCH-2 and accepting the lower (but real) numbers forced a much more honest framing.

An earlier metric we explored, TLE (an endpoint-style statistic comparing update norms at early vs. late layers), turned out to be weaker than UDC under controlled evaluation. Pivoting away from TLE mid-hackathon was a necessary design correction.

We also attempted to push UDC into broader product-style settings (administrative and clinical validation workflows). Those results were weak and inconsistent, which forced us to narrow the claim to what the data actually supports: a confabulation monitor under controlled evaluation, not a universal truth detector.

Finally, running the full live model flow was too heavy for a laptop demo, so we precomputed all Gemma artifacts and built the app around curated demo cases with the benchmark evidence in a separate audit tab.

What's next for Confab

The immediate priority is broader controlled evaluation: testing UDC across more models and more categories of error, including factual retrieval failures, reasoning mistakes, and context-grounding errors. The goal is to map out exactly where the hidden-state signal holds up and where it breaks down, with the same shortcut-control discipline we applied here.