SynthTrial Studio — Constraint-aware synthetic EDC with auto-repair, what-if efficacy, RBQM & CSR
Short description Spin up realistic clinical trial datasets (no PHI), validate & auto-repair, test YAML edit checks, explore what-ifs, and auto-draft RBQM/CSR docs—all in one Streamlit app.
Elevator pitch (≤200 chars) Clinical data is slow and error-prone. SynthTrial Studio generates realistic synthetic data, enforces constraints, tests YAML checks, runs what-ifs, and drafts RBQM/CSR.
Inspiration
Clinical data work is fragmented: generate plausible datasets, run edit checks, fix issues, monitor risk, then draft CSR snippets. We wanted a single workbench to do all of that safely with synthetic data only.
What it does
SynthTrial Studio is a Streamlit app that lets you:
Generate synthetic vitals under clinical constraints
Rules-based generator
MVN (multivariate normal) generator that preserves correlations per Visit×Arm
DDPM diffusion (optional, PyTorch) small MLP denoiser for tabular sampling
Guardrails: PHI lint blocks uploads that look like identifiers (synthetic-only).
Validate & auto-repair: ranges, fever logic (Temp>38 needs HR≥67), unique keys, arm consistency, and snap Week-12 effect (Active − Placebo) to target.
Power / stats tiles: means, effect, SE, p-value (Welch/approx), Cohen’s d.
Oncology helper: simulate RECIST at Week-12 and compute ORR difference.
Edit-Check Studio (YAML): define rules (range, regex, allowed values, required visits, uniqueness, constants) → raise EDC-style queries.
RBQM dashboard: KRIs, site QTLs, queries per 100 rows, site roll-ups, AE signals.
What-If Simulator: drag a slider to set target effect; preview validation & plots.
Docs & exports: SDTM-like VS (TSV), CSR draft (Markdown), RBQM summary (Markdown), full ZIP with reports & docs.
TMF helpers: SIV Log, Investigator DB placeholders to round out demos.
How we built it
Frontend/App: Streamlit + Altair charts
Data/Stats: NumPy, Pandas, SciPy (optional), custom Welch/normal approx fallback
Generators:
Rules-based (simple, fast)
MVN per Visit×Arm: learn mean/cov, stabilize with εI, sample & clamp
DDPM (prototype): tiny MLP ε-predictor with sinusoidal time embeddings; trains in minutes on CPU for demo-scale data
LLM CSV mode (optional): OpenAI API to produce schema-locked CSV; post-validated and regenerated on failures
Edit Checks: YAML via PyYAML → engine raises EDC-style queries
Docs: Markdown builders for CSR & RBQM; ZIP packager for submission
Architecture (at a glance)
Data in: Generate (Rules/MVN/DDPM/LLM) or upload synthetic CSV/TSV → PHI lint.
Validate: deterministic checks + report; Auto-repair if desired.
Analyze: stats tiles, ORR helper, distribution checks vs pilot (KS & QQ).
Monitor: YAML edit checks → RBQM KRIs/QTLs + site roll-ups.
Export: SDTM VS, CSR draft, RBQM summary, and full ZIP.
Why it’s different
Constraint-aware from the start (schema-locked CSV, clinical ranges, fever logic).
Multiple generators so teams can pick “speed vs realism”.
Edit checks + RBQM are first-class, not an afterthought.
What-if efficacy makes assumptions explicit and reproducible.
Synthetic-only by design for safe demos and method prototyping.
Challenges we ran into
Balancing realism vs. speed for hackathon-friendly training times.
Making the LLM CSV path robust: we added a validate→feedback→regenerate loop.
Packaging an end-to-end flow that still feels simple in Streamlit.
Accomplishments we’re proud of
A clean, judge-ready demo: generate → validate/repair → what-if → RBQM → export.
A tiny DDPM that works on CPU for tabular vitals with Visit×Arm conditioning.
Reusable YAML rule engine to mirror EDC queries.
What we learned
Small, well-chosen constraints dramatically improve synthetic data quality.
Site-level QTLs are easy to compute once you standardize queries & visits.
Schema-locked LLM generation is viable with strict validation & feedback.
What’s next
Expand domains: labs, concomitant meds, dosing, more oncology endpoints.
Add semi-synthetic mode (fit from user’s de-identified distributions only).
Enrich RBQM: more KRIs (e.g., lag times, protocol deviations).
One-click Streamlit Cloud deploy script + demo seed data.
Optional GPU notebook for faster DDPM experimentation.
Demo guide
Generate / Load
Try Rules (fast), MVN (correlated), or DDPM (after “Train DDPM”).
Or upload your synthetic CSV/AE TSV (PHI lint will block obvious identifiers).
Validate & Repair
Review checks; click Auto-Repair; see effect, p-value, Cohen’s d tiles.
Open Oncology ORR expander for CR/PR vs SD/PD helper.
What-If Simulator
Slide effect target to see how outcomes & validation change.
Edit Checks (YAML)
Run rules; inspect queries; download CSV of findings.
RBQM Dashboard
Inspect KRIs/QTLs and site roll-ups; download RBQM Markdown.
Export
Grab SDTM VS, CSR draft, RBQM summary, or the full submission ZIP.
Built With
python, streamlit, pandas, numpy, altair, scipy
pyyaml, torch (CPU OK), openai (optional)
Try it out
Live demo: add your Streamlit Cloud link here
GitHub: add your repo link here
Video: coming soon
Setup (local)
Python 3.10+
pip install -r requirements.txt
If you want DDPM: CPU-only Torch is fine
Windows users: install the official cpu wheel for your Python version
streamlit run app/app.py
Optional LLM path:
export OPENAI_API_KEY=... # or set in the UI
Disclaimers
All data is synthetic; do not use for clinical decisions.
PHI lint is best-effort; always ensure uploads are de-identified.
Built With
- computer
- python
- statistics
- vision

Log in or sign up for Devpost to join the conversation.