VERIFAI

Inspiration

Medical AI systems are getting stronger, but many still behave like black boxes in the moments that matter most. In clinical settings, that is a trust problem, not just a UX problem. VERIFAI was inspired by a simple idea: diagnosis should look more like a multidisciplinary case conference than a single model output.

Instead of asking one model for one answer, we asked: what if independent specialist agents could critique each other, reference evidence, express uncertainty, and invite human correction before finalizing a decision? That question became VERIFAI.

What it does

VERIFAI is a multi-agent diagnostic workflow for chest imaging that combines:

Visual reasoning from radiology models
Chefer-based explainability heatmaps for visual attribution
Structured pathology labeling
Clinical context from patient history
Literature-backed evidence retrieval
Adversarial safety critique and debate

The system produces:

A proposed diagnosis with calibrated confidence
A transparent execution trace (agent-by-agent)
Evidence packets linking image findings, history, and literature
Human-in-the-loop review with feedback-driven reruns

Rather than presenting confidence as a static probability, VERIFAI updates confidence as evidence quality and inter-agent agreement evolve.

Agent roles (one-line each)

Radiologist: Generates primary findings and impression from chest imaging while producing uncertainty-aware visual analysis. CheXbert: Converts free-text/visual findings into structured pathology labels for downstream consistency checks. Historian: Retrieves and summarizes patient-context signals from longitudinal clinical records and FHIR history. Literature: Pulls and synthesizes relevant medical evidence to support or challenge candidate diagnoses. Critic: Stress-tests the current diagnosis for overconfidence, missing differentials, and safety risks. Debate: Reconciles disagreements between agents and drives consensus-building before finalization. Validator: Applies final safety and consistency checks before the diagnosis is presented for approval. Feedback: Injects clinician corrections and routes the workflow into a feedback-aware rerun path.

How we built it

We built VERIFAI as a modular orchestration stack with explicit interfaces between reasoning components.

Workflow engine: graph-based execution with resumable sessions
Agent layer: radiologist, chexbert, historian, literature, critic, debate, validator
Backend: API routes for start, stream, status, and human-review resume
Observability: structured logging, trace persistence, and session-level auditability
Frontend: live execution feed, audit trail, safety view, and review controls
Explainability: Chefer heatmap implementation to visualize region-level evidence behind visual predictions

A core design choice was uncertainty-aware reasoning. We model system confidence as a function of uncertainty, evidence strength, and agreement:

$$ C_{final} = \sigma\left(\alpha E + \beta A - \gamma U\right) $$

where:

C_final is the calibrated final confidence
E is the aggregated evidence quality
A is inter-agent alignment (consensus)
U is system uncertainty
σ(·) is a calibration transform

We also track uncertainty propagation across stages:

$$ U_{t+1} = \max\big(0,\; U_t - \eta\,IG_t + \lambda\,D_t\big) $$

where:

IG_t is the information gain at step t
D_t is the disagreement penalty at step t

Challenges we ran into

Building an explainable multi-agent system surfaced engineering and product challenges beyond model inference:

Keeping agents independent while preventing contradictory drift
Managing resumable workflow state across interruptions and feedback loops
Preventing silent fallbacks that mask real runtime/model failures
Balancing latency against evidence depth in retrieval and critique
Representing uncertainty in ways clinicians can actually act on

We also had to harden error handling so failures are visible in the UI and audit trail, not hidden behind mock-like outputs.

Accomplishments that we're proud of

VERIFAI moved from concept to a working, inspectable diagnostic system with real operational controls.

End-to-end multi-agent pipeline with auditable reasoning steps
Human-in-the-loop rejection/approval with workflow resume
Real-time trace streaming from backend to frontend
Chefer heatmap implementation integrated into the visual evidence workflow
Evidence-centric output instead of confidence-only output
Improved reliability through explicit failure surfacing and safer state handling

Most importantly, we built for clinical trust: every important claim can be traced, challenged, and revised.

What we learned

This project reinforced that high-performing AI is not enough for high-stakes deployment.

Transparency is foundational, not optional
Multi-agent design improves robustness only with strong orchestration discipline
Error visibility is a safety feature
Human feedback loops must be first-class, not an afterthought
Structured observability dramatically accelerates iteration and debugging

We also learned that uncertainty needs to be operationalized, not merely reported.

What's next for VERIFAI

Our roadmap focuses on deeper clinical validity, safer deployment, and stronger human-AI collaboration.

Improve calibration across rare and ambiguous pathologies
Expand longitudinal patient-context integration from FHIR timelines
Upgrade retrieval quality scoring and evidence ranking
Add richer clinician controls for guided reruns and rationale comparison
Run broader benchmarking and prospective pilot evaluations

Long term, we want VERIFAI to function as a clinically trusted second reader that is transparent by design and continuously improvable through expert feedback.

Built With

fastapi
langgraph
next.js
python
pytorch

Updates

Aayush_Kumar Kumar started this project — Apr 17, 2026 03:00 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.