SynthTrial Studio — Constraint-aware synthetic EDC with auto-repair, what-if efficacy, RBQM & CSR

Short description Spin up realistic clinical trial datasets (no PHI), validate & auto-repair, test YAML edit checks, explore what-ifs, and auto-draft RBQM/CSR docs—all in one Streamlit app.

Elevator pitch (≤200 chars) Clinical data is slow and error-prone. SynthTrial Studio generates realistic synthetic data, enforces constraints, tests YAML checks, runs what-ifs, and drafts RBQM/CSR.

Inspiration

Clinical data work is fragmented: generate plausible datasets, run edit checks, fix issues, monitor risk, then draft CSR snippets. We wanted a single workbench to do all of that safely with synthetic data only.

What it does

SynthTrial Studio is a Streamlit app that lets you:

Generate synthetic vitals under clinical constraints

Rules-based generator

MVN (multivariate normal) generator that preserves correlations per Visit×Arm

DDPM diffusion (optional, PyTorch) small MLP denoiser for tabular sampling

Guardrails: PHI lint blocks uploads that look like identifiers (synthetic-only).

Validate & auto-repair: ranges, fever logic (Temp>38 needs HR≥67), unique keys, arm consistency, and snap Week-12 effect (Active − Placebo) to target.

Power / stats tiles: means, effect, SE, p-value (Welch/approx), Cohen’s d.

Oncology helper: simulate RECIST at Week-12 and compute ORR difference.

Edit-Check Studio (YAML): define rules (range, regex, allowed values, required visits, uniqueness, constants) → raise EDC-style queries.

RBQM dashboard: KRIs, site QTLs, queries per 100 rows, site roll-ups, AE signals.

What-If Simulator: drag a slider to set target effect; preview validation & plots.

Docs & exports: SDTM-like VS (TSV), CSR draft (Markdown), RBQM summary (Markdown), full ZIP with reports & docs.

TMF helpers: SIV Log, Investigator DB placeholders to round out demos.

How we built it

Frontend/App: Streamlit + Altair charts

Data/Stats: NumPy, Pandas, SciPy (optional), custom Welch/normal approx fallback

Generators:

Rules-based (simple, fast)

MVN per Visit×Arm: learn mean/cov, stabilize with εI, sample & clamp

DDPM (prototype): tiny MLP ε-predictor with sinusoidal time embeddings; trains in minutes on CPU for demo-scale data

LLM CSV mode (optional): OpenAI API to produce schema-locked CSV; post-validated and regenerated on failures

Edit Checks: YAML via PyYAML → engine raises EDC-style queries

Docs: Markdown builders for CSR & RBQM; ZIP packager for submission

Architecture (at a glance)

Data in: Generate (Rules/MVN/DDPM/LLM) or upload synthetic CSV/TSV → PHI lint.

Validate: deterministic checks + report; Auto-repair if desired.

Analyze: stats tiles, ORR helper, distribution checks vs pilot (KS & QQ).

Monitor: YAML edit checks → RBQM KRIs/QTLs + site roll-ups.

Export: SDTM VS, CSR draft, RBQM summary, and full ZIP.

Why it’s different

Constraint-aware from the start (schema-locked CSV, clinical ranges, fever logic).

Multiple generators so teams can pick “speed vs realism”.

Edit checks + RBQM are first-class, not an afterthought.

What-if efficacy makes assumptions explicit and reproducible.

Synthetic-only by design for safe demos and method prototyping.

Challenges we ran into

Balancing realism vs. speed for hackathon-friendly training times.

Making the LLM CSV path robust: we added a validate→feedback→regenerate loop.

Packaging an end-to-end flow that still feels simple in Streamlit.

Accomplishments we’re proud of

A clean, judge-ready demo: generate → validate/repair → what-if → RBQM → export.

A tiny DDPM that works on CPU for tabular vitals with Visit×Arm conditioning.

Reusable YAML rule engine to mirror EDC queries.

What we learned

Small, well-chosen constraints dramatically improve synthetic data quality.

Site-level QTLs are easy to compute once you standardize queries & visits.

Schema-locked LLM generation is viable with strict validation & feedback.

What’s next

Expand domains: labs, concomitant meds, dosing, more oncology endpoints.

Add semi-synthetic mode (fit from user’s de-identified distributions only).

Enrich RBQM: more KRIs (e.g., lag times, protocol deviations).

One-click Streamlit Cloud deploy script + demo seed data.

Optional GPU notebook for faster DDPM experimentation.

Demo guide

Generate / Load

Try Rules (fast), MVN (correlated), or DDPM (after “Train DDPM”).

Or upload your synthetic CSV/AE TSV (PHI lint will block obvious identifiers).

Validate & Repair

Review checks; click Auto-Repair; see effect, p-value, Cohen’s d tiles.

Open Oncology ORR expander for CR/PR vs SD/PD helper.

What-If Simulator

Slide effect target to see how outcomes & validation change.

Edit Checks (YAML)

Run rules; inspect queries; download CSV of findings.

RBQM Dashboard

Inspect KRIs/QTLs and site roll-ups; download RBQM Markdown.

Export

Grab SDTM VS, CSR draft, RBQM summary, or the full submission ZIP.

Built With

python, streamlit, pandas, numpy, altair, scipy

pyyaml, torch (CPU OK), openai (optional)

Try it out

Live demo: add your Streamlit Cloud link here

GitHub: add your repo link here

Video: coming soon

Setup (local)

Python 3.10+

pip install -r requirements.txt

If you want DDPM: CPU-only Torch is fine

Windows users: install the official cpu wheel for your Python version

streamlit run app/app.py

Optional LLM path:

export OPENAI_API_KEY=... # or set in the UI

Disclaimers

All data is synthetic; do not use for clinical decisions.

PHI lint is best-effort; always ensure uploads are de-identified.

Built With

  • computer
  • python
  • statistics
  • vision
Share this project:

Updates