Inspiration

I audited 98 studies cited in the NCCN Kidney Cancer Guidelines v3.2022 with MIT Critical Data — 44,636 patients total. North America was enrolled at 2.07× its global disease burden, Africa at 0.07×, Black patients at roughly half their burden rate, and 81% of studies lacked usable race data. The bias was not discovered during design. It was documented after publication.

TrialTwin moves that check to the start: simulate who will actually enroll before the protocol is locked.

What it does

TrialTwin is a pre-recruitment simulator, not a predictor. You configure disease, target N, total sites, countries with weights, and inclusion criteria. The engine runs 10,000 Monte Carlo simulations of site activation, screening, enrollment, and dropout.

For each region I compute:

$$ RQ_r = \dfrac{\text{projected enrollment share}_r}{\text{global incidence share}_r} $$

The aggregate Representation Equity Score is a weighted geometric mean that stays finite when a region approaches zero:

$$ RES = \exp\left(\sum_i w_i \ln(RQ_i + \varepsilon)\right) - \varepsilon \quad\text{where } w_i = \text{incidence share}_i,\ \varepsilon = 10^{-4} $$

I report:

  • Jensen-Shannon divergence from the incidence-proportionate baseline
  • Empirical chi-square exceedance rate across runs (descriptive, not inferential — regional shares are compositional)
  • Zero-enrollment rate per region
  • Simulation interval (10–90%), not a confidence interval

Recommendations are scored computationally: each candidate country is tested with 1,000-run mini-simulations and ranked by \(\Delta RES \times \text{feasibility}\).

Supports 20 diseases (2 audit-calibrated: Kidney Cancer C64, Hypertension I10-I15; 18 generic with GLOBOCAN 2022 incidence) across a 43-country whitelist.

How I built it

Frontend: Next.js 14 App Router, Tailwind, Zustand, Recharts + D3 choropleth, React Hook Form + Zod. Scientific editorial UI.

Backend: FastAPI, Python 3.11, NumPy + SciPy vectorized core, ProcessPoolExecutor for CPU-bound runs, Server-Sent Events for live progress, SQLite for run persistence.

Data layer (local JSON only):

  • disease_priors.json — regional incidence shares normalized to 1.0 for 20 diseases
  • enrollment_priors.json — country accessibility indices derived from the 98-study audit
  • demographic_by_disease.json — sex/age priors (US SEER 2015–2021 gated to US-inclusive trials)
  • audit_98_studies.json — source priors for RCC/Hypertension

Engine per run: largest-remainder site allocation → per-site Bernoulli activation → Poisson patient flow → demographic sampling (truncated normal for age, binomial for sex) → pre-sampling inclusion filter → target-N cap via downsampling → RQ/RES.

LLM is off the critical path: rules-based pycountry parser with alias map (GB/UK, KR/Republic of Korea, CZ/Czechia, TR/Turkiye, AE/UAE, VN/Viet Nam), template-first summary, Ollama/Qwen only polishes the 2-sentence rationale.

Challenges I ran into

Statistical theater: ANOVA and Kruskal-Wallis on my own draws returned \(p < 0.001\) by construction. Replaced with JSD and empirical chi-square exceedance.
RES collapse: Harmonic mean drove RES to ∼0 when Africa RQ ≈ 0.07. Switched to weighted geometric mean with epsilon.
Mislabeling: 10th–90th percentiles were called "CI". Relabeled to "Simulation interval (10–90%)".
Trial mechanics: Target N was not enforced; site activation was one Bernoulli per country; age filter ran after sampling (min = max = 90 returned mean 64.3). Fixed with N-cap downsampling, per-site activation, and pre-sampling filters including degenerate age handling.
Parser fragility: Free-text aliases broke ISO validation. Built deterministic alias map + strict 43-code whitelist.
Performance: Per-patient Python loops blocked SSE. Vectorized across runs and chunked updates.
Priors mismatch: Generic diseases briefly showed the RCC audit banner. Added disease-gated prior loading.

Accomplishments that I'm proud of

  • Finite, interpretable RES across extreme underrepresentation (no NaN, no 0.00 collapse)
  • Honest reporting: zero-enrollment rates surfaced, race panel gated to US-only, generic-priors banner visible
  • Deterministic 43-country allocation: 200 sites = exactly 200 via largest-remainder
  • Disease switch actually swaps priors (RCC ≠ TB ≠ HCC)
  • 10,000-run jobs stream in 5–15s on laptop hardware with second concurrent run supported
  • Shareable run IDs with full config and data version audit trail

What I learned

Priors are political. Using published studies as a generative prior bakes historical bias into the model. Compositional outputs violate ANOVA independence — divergence metrics are more honest than p-values. A single epsilon choice dominates RES stability. Labeling matters: calling a percentile a "confidence interval" destroys trust instantly. Rules-based parsing beats LLM cleverness for ISO codes.

What's next for TrialTwin

  1. Replace independent draws with hierarchical Dirichlet-multinomial to respect regional covariance
  2. Calibrate accessibility indices against ClinicalTrials.gov completion data, not just publications
  3. Add cost, regulatory timeline, and site capacity to recommendation scoring
  4. Export IRB-ready PDF appendix with methods, formulas, data versions, and limitations
  5. Expand audit-calibrated diseases beyond RCC and Hypertension

Built With

Share this project:

Updates