CheckMate

Inspiration

Rare diseases affect more than 300 million people worldwide, yet the average patient waits over five years for an accurate diagnosis. In many cases, the issue is not a lack of medical knowledge, but a lack of direction. Even with modern tools that can generate a ranked list of candidate diseases from HPO phenotype terms, clinicians are still left asking: What should I evaluate next?

We were especially struck by how this uncertainty disproportionately affects patients in under-resourced settings. A recommendation to “order whole exome sequencing” or “obtain an MRI” may be clinically sound but practically inaccessible. Meanwhile, subtle but highly informative history questions or physical exam findings may go unasked. CheckMate was inspired by the idea that diagnosis should not just be about predicting the right disease, it should be about identifying the right next move. We wanted to transform differential diagnosis from a static ranking problem into a dynamic decision process.

What it does

CheckMate is an equity-aware engine for rare disease diagnosis. Given a patient's observed phenotype terms (HPO codes), it performs three core functions:

1. Generates a calibrated probabilistic disease ranking. CheckMate builds on DeepRare [7], the current state-of-the-art rare disease diagnostic system published in Nature in February 2026. DeepRare integrates HPO terms, free-text clinical notes, and genomic data to produce a ranked list of candidate rare diseases with traceable reasoning chains. CheckMate takes DeepRare's structured differential as a starting point and feeds it into a Partial Variational Autoencoder (Partial VAE) [2, 6]. The VAE models the patient's case as a latent probability distribution rather than a fixed label, producing a calibrated uncertainty-aware differential that reflects how confident the system truly is given the quality and completeness of available patient data. DeepRare is used only at inference time; the VAE is trained independently on synthetic patient data derived from the Human Phenotype Ontology Annotation database (HPOA) [4].

2. Determines the most informative next phenotype to assess. There are approximately 19,000 HPO phenotypes that might not yet be documented in a patient's record. For each unobserved phenotype, CheckMate's information gain (IG) engine simulates two futures: one in which the clinician checks for it and finds it present, and one in which it is absent. It measures how much each outcome would shift the disease differential using KL divergence, then weights those shifts by the VAE's predicted probability of each outcome:

IG(j) = p_pos × KL(P_now ‖ P_present) + (1 − p_pos) × KL(P_now ‖ P_absent)

The phenotype with the highest expected information gain is recommended as the next acquisition step [2].

3. Adjusts recommendations for equity when records are sparse. For patients with sparse documentation, the IG engine divides each candidate's information gain by its clinical cost tier before ranking:

Score(j) = IG(j) / cost_tier(j)

This means a free clinical observation with slightly lower raw IG can outrank an expensive genetic test, prioritizing accessible assessments for under-documented patients. The system also applies group-conditional conformal prediction [3] to ensure that its diagnostic shortlist carries a formal 90% coverage guarantee for every patient group. Standard conformal prediction gives sparse patients only ~72% coverage against a 90% target (an 18 percentage-point gap). CheckMate's group-conditional calibration closes that gap to 1 percentage point by calibrating separate thresholds for patients stratified by documentation depth.

How we built it

1. SHEPHERD Knowledge Graph Embeddings

We use frozen pretrained embeddings from SHEPHERD [1], a few-shot rare disease diagnosis model trained over a knowledge graph that integrates HPO phenotypes, OMIM diseases, and gene relationships. These 64-dimensional embeddings encode biological and hierarchical relationships between phenotypes, diseases, and genes, giving the downstream VAE structured priors even when patients present with very few observed features. We ran SHEPHERD weight extraction as a dedicated SLURM job (0.8 min, Quadro RTX 6000) and used the resulting embeddings as fixed inputs to the VAE encoder — preserving the biological knowledge encoded in SHEPHERD's training without risk of catastrophic forgetting through fine-tuning.

2. Partial VAE for Active Acquisition

We trained a Partial Variational Autoencoder [2, 6] on 226,000 synthetic patients derived from HPOA disease–phenotype frequency tables, with 30–70% random masking at training time to simulate incomplete clinical records. The architecture uses a permutation-invariant set encoder to embed variable-length phenotype sets into a latent Gaussian distribution. From this latent state, two decoders produce outputs in parallel: a disease probability distribution over 12,971 rare disease classes (the diagnostic shortlist), and a phenotype probability distribution over unobserved HPO terms (the acquisition signal). The phenotype decoder is the novel contribution that enables IG computation as standard disease prediction systems do not model which phenotypes are likely to be present given the current chart, and therefore cannot compute which test would be most informative.

Training ran for 45 epochs (on an NVIDIA RTX A5500). Final validation performance: 64.2% top-1 disease accuracy and 82.5% top-5 recall across 12,971 disease classes. Post-training temperature scaling (scalar: 1.39×) yielded an Expected Calibration Error of 0.021.

3. Three-Layer Equity Architecture

Rather than building a separate bias-detection model, CheckMate treats its own uncertainty as an equity signal:

Layer 1 — Data quality scoring: The VAE's reconstruction entropy over unobserved phenotypes serves as a data completeness score. High entropy means the model lacks the information to make reliable predictions, signalling a sparse record.
Layer 2 — Group-conditional conformal calibration: Instead of a single global calibration threshold, we calibrate separate conformal thresholds for three patient groups defined by documentation depth (sparse: 1–3 phenotypes, moderate: 4–6, well-documented: 7+). This guarantees ≥90% coverage within each group, not just on average, ensuring that sparse patients receive honest, larger diagnostic sets rather than falsely confident small ones [3].
Layer 3 — Cost-weighted acquisition: When reconstruction entropy is high (sparse record), IG recommendations are divided by a clinical cost tier before ranking, steering the system toward history questions and physical exam findings before expensive specialist investigations.

We optimized for computational feasibility by batching KL computations and restricting the candidate phenotype pool to terms with meaningful probability mass under the decoder. The full pipeline ran across 5 SLURM jobs on Brown University's Oscar HPC cluster.

Challenges we ran into

One major challenge we faced was limited documentation and missing core datasets in the original DeepRare release. The published model did not include several foundational resources required for full reproducibility and optimal accuracy, so we had to engineer workarounds and rebuild parts of the pipeline ourselves. Another challenge was data availability and structure. High-quality, labeled rare disease datasets are scarce, so we had to simulate and carefully structure synthetic patient data derived from real phenotype frequency tables to properly train our models. Finally, training time and convergence were significant hurdles within the hackathon timeframe. Optimizing model stability and computational efficiency while maintaining performance required careful tuning and use of the model architecutre, hyperparameters, and Brown’s supercomputing resources.

Accomplishments that we're proud of

We are proud that we were able to build directly on the latest rare disease AI and push it one step further to fully align with the mission of this hackathon. Rather than starting from scratch, we integrated state-of-the-art diagnostic technology and extended it into something more actionable: transforming disease ranking into a next-step decision engine.

We are also proud that we trained our own machine learning models on Brown’s supercomputing cluster. This allowed us to go beyond theoretical design and actually implement and evaluate a working system capable of generating next-step recommendations. Training the Partial VAE and fine-tuning graph embeddings at scale within the hackathon timeframe was a major technical milestone for our team.

Most importantly, we are proud of our novel approach to embedding equity directly into the decision-making process. Instead of treating fairness as an afterthought, we built uncertainty-aware calibration and accessibility-aware recommendation steering into the core architecture. The result is not just a smarter diagnostic tool, but one that actively works to ensure that limited documentation or limited resources do not translate into worse recommendations.

What we learned

Building CheckMate taught us that rare disease AI operates under fundamentally different constraints than standard machine learning. With over 12,971 disease classes, many with only a handful of annotated cases, traditional supervised assumptions break down and making synthetic training from annotation frequency tables a practical necessity. We also learned that biological priors matter more than sheer model complexity; a moderately sized VAE built on pretrained SHEPHERD embeddings outperformed our custom GNN despite the latter’s strong link-prediction metrics, underscoring the value of real-patient signal embedded in knowledge graphs. Implementing information-theoretic acquisition via the EDDI framework proved theoretically elegant but highly sensitive to small coding errors, where subtle mistakes in acquisition weights could generate plausible yet incorrect recommendations. We further recognized that calibration is essential in clinical contexts, accuracy alone is insufficient if uncertainty is misrepresented, making temperature scaling and conformal prediction critical for safe deployment.

What's next for CheckMate

The next phase of CheckMate focuses on real-world validation and clinical integration. Our top priority is testing the full step-by-step acquisition loop on real patient data to measure how many steps it takes to reach a diagnosis, the total cost per case, and whether outcomes remain equitable across different patient groups. We also plan to externally calibrate our conformal prediction thresholds using established RareBench datasets to ensure robustness beyond our synthetic training data. To validate the acquisition engine itself, we will measure how well predicted information gain aligns with actual posterior shifts when true phenotype outcomes are observed. On the implementation side, mapping recommended phenotypes to standardized LOINC procedure codes will make outputs directly usable in clinical workflows, and integrating with HL7 FHIR-based EHR systems will allow CheckMate to operate on real patient records instead of manually entered phenotype lists.