Inspiration

Medical AI systems are getting stronger, but many still behave like black boxes in the moments that matter most. In clinical settings, that is a trust problem, not just a UX problem. VERIFAI was inspired by a simple idea: diagnosis should look more like a multidisciplinary case conference than a single model output.

Instead of asking one model for one answer, we asked: what if independent specialist agents could critique each other, reference evidence, express uncertainty, and invite human correction before finalizing a decision? That question became VERIFAI.

What it does

VERIFAI is a multi-agent diagnostic workflow for chest imaging that combines:

  • Visual reasoning from radiology models
  • Chefer-based explainability heatmaps for visual attribution
  • Structured pathology labeling
  • Clinical context from patient history
  • Literature-backed evidence retrieval
  • Adversarial safety critique and debate

The system produces:

  • A proposed diagnosis with calibrated confidence
  • A transparent execution trace (agent-by-agent)
  • Evidence packets linking image findings, history, and literature
  • Human-in-the-loop review with feedback-driven reruns

Rather than presenting confidence as a static probability, VERIFAI updates confidence as evidence quality and inter-agent agreement evolve.

Agent roles (one-line each)

Radiologist: Generates primary findings and impression from chest imaging while producing uncertainty-aware visual analysis. CheXbert: Converts free-text/visual findings into structured pathology labels for downstream consistency checks. Historian: Retrieves and summarizes patient-context signals from longitudinal clinical records and FHIR history. Literature: Pulls and synthesizes relevant medical evidence to support or challenge candidate diagnoses. Critic: Stress-tests the current diagnosis for overconfidence, missing differentials, and safety risks. Debate: Reconciles disagreements between agents and drives consensus-building before finalization. Validator: Applies final safety and consistency checks before the diagnosis is presented for approval. Feedback: Injects clinician corrections and routes the workflow into a feedback-aware rerun path.

How we built it

We built VERIFAI as a modular orchestration stack with explicit interfaces between reasoning components.

  • Workflow engine: graph-based execution with resumable sessions
  • Agent layer: radiologist, chexbert, historian, literature, critic, debate, validator
  • Backend: API routes for start, stream, status, and human-review resume
  • Observability: structured logging, trace persistence, and session-level auditability
  • Frontend: live execution feed, audit trail, safety view, and review controls
  • Explainability: Chefer heatmap implementation to visualize region-level evidence behind visual predictions

A core design choice was uncertainty-aware reasoning. We model system confidence as a function of uncertainty, evidence strength, and agreement:

$$ C_{final} = \sigma\left(\alpha E + \beta A - \gamma U\right) $$

where:

  • C_final is the calibrated final confidence
  • E is the aggregated evidence quality
  • A is inter-agent alignment (consensus)
  • U is system uncertainty
  • σ(·) is a calibration transform

We also track uncertainty propagation across stages:

$$ U_{t+1} = \max\big(0,\; U_t - \eta\,IG_t + \lambda\,D_t\big) $$

where:

  • IG_t is the information gain at step t
  • D_t is the disagreement penalty at step t

Challenges we ran into

Building an explainable multi-agent system surfaced engineering and product challenges beyond model inference:

  • Keeping agents independent while preventing contradictory drift
  • Managing resumable workflow state across interruptions and feedback loops
  • Preventing silent fallbacks that mask real runtime/model failures
  • Balancing latency against evidence depth in retrieval and critique
  • Representing uncertainty in ways clinicians can actually act on

We also had to harden error handling so failures are visible in the UI and audit trail, not hidden behind mock-like outputs.

Accomplishments that we're proud of

VERIFAI moved from concept to a working, inspectable diagnostic system with real operational controls.

  • End-to-end multi-agent pipeline with auditable reasoning steps
  • Human-in-the-loop rejection/approval with workflow resume
  • Real-time trace streaming from backend to frontend
  • Chefer heatmap implementation integrated into the visual evidence workflow
  • Evidence-centric output instead of confidence-only output
  • Improved reliability through explicit failure surfacing and safer state handling

Most importantly, we built for clinical trust: every important claim can be traced, challenged, and revised.

What we learned

This project reinforced that high-performing AI is not enough for high-stakes deployment.

  • Transparency is foundational, not optional
  • Multi-agent design improves robustness only with strong orchestration discipline
  • Error visibility is a safety feature
  • Human feedback loops must be first-class, not an afterthought
  • Structured observability dramatically accelerates iteration and debugging

We also learned that uncertainty needs to be operationalized, not merely reported.

What's next for VERIFAI

Our roadmap focuses on deeper clinical validity, safer deployment, and stronger human-AI collaboration.

  • Improve calibration across rare and ambiguous pathologies
  • Expand longitudinal patient-context integration from FHIR timelines
  • Upgrade retrieval quality scoring and evidence ranking
  • Add richer clinician controls for guided reruns and rationale comparison
  • Run broader benchmarking and prospective pilot evaluations

Long term, we want VERIFAI to function as a clinically trusted second reader that is transparent by design and continuously improvable through expert feedback.

Built With

Share this project:

Updates