Inspiration

Every single day, someone at a pharma company pastes a confidential clinical trial document into ChatGPT.

A medical writer is cleaning up grammar—a pharmacovigilance specialist drafting a safety report. A regulatory affairs manager is getting help with an FDA letter. A data analyst is uploading interim efficacy results to get a chart description.

58% of front-line health staff use unapproved AI tools for work. 44% admit to including identifiable patient data at least occasionally.

The existing options are all broken. Block AI tools entirely, and staff route around you on personal devices. Sign an enterprise BA, A, and the sanctioned tools lag so far behind that people still use shadow AI for the hard stuff. Run regex-based D, LP, and it either over-redacts and destroys clinical meaning, or under-redacts and leaks.

But here's the part nobody talks about: compliance training focuses on the 18 HIPAA Safe Harbor identifiers — names, SSNs, MRNs. The most damaging leakage in clinical trials isn't PHI at all. It's Material Non-Public Information. Compound codenames that reveal a company's pipeline strategy. Interim efficacy readouts that could move stock prices by billions. Amendment rationales that signal safety problems before the sponsor announces them.

No regex catches "ORR of 47% in the 200mg arm versus 22% in control." That sentence has no patient identifiers. It is worth billions. And it just left your machine.

That's the gap GhostDraft fills.

What It Does

GhostDraft is a clinical privacy workspace. It sits between your staff and cloud AI — and it makes sure nothing sensitive ever leaves your machine.

It intercepts every document, strips all sensitive information on-device, sends only a safe anonymized version to the cloud, gets the response back, and re-injects the original values locally before showing the answer. The user gets a useful AI response. The cloud never sees the real data.

For Reviewers,

pharmacovigilance specialists and medical writers: Paste an SAE narrative. GhostDraft detects every sensitive entity in real time — patient IDs, compound codes, site numbers, efficacy values, amendment references — all flagged before a single character leaves the machine. Click Assemble, and it builds a full multi-track case timeline: event severity over time, dosing markers, concomitant medications, lab values, and a WHO-UMC causality verdict. The right pane plots the case against every other adverse event in the study window, auto-detects density clusters, and generates recommended safety actions.

For Analysts,

Clinical data scientists and statisticians: A virtualized dataset table with privacy-aware cell highlighting. A chat assistant backed by the full privacy pipeline. And a dashboard generator — type a natural language prompt, get a live chart grid back. The cloud receives only aggregate statistics, never raw rows.

The forensic dock at the bottom of every screen shows exactly what happened on every call: what was sent to the cloud (placeholders only), what came back, and what the user sees (original values restored). A live differential privacy budget bar tracks ε consumption. When the session budget is exhausted, the system hard-refuses further requests rather than silently degrading.

How We Built It

The privacy pipeline has three stages. First, a deterministic Safe Harbor stripper using regex plus local NER detects all 18 HIPAA identifiers plus clinical quasi-identifiers (compound codes, site IDs, doses, AE grades) and MNPI categories (efficacy values, interim results, amendment rationales). Every detected entity is replaced with a tagged placeholder. The entity map never leaves the process.

Second, a neural router classifies each request into one of three paths: abstract-extractable (the task can be done without the sensitive values — query rewritten from scratch, mathematically uninvertible), DP-tolerant (calibrated Gaussian noise added to embeddings, formal (ε, δ)-differential privacy guarantees via Rényi DP accounting), or local-only (answered entirely on-device, nothing sent to the cloud).

Third, the answer applier re-injects the entity map into the cloud response, longest-key-first to avoid partial-match collisions.

The backend is a FastAPI service with five endpoint groups: the core privacy pipeline (/analyze, /proxy, /route, /complete), timeline assembly (/timeline/assemble), signal detection (/signal/cluster), dataset querying (/dataset/query), and dashboard generation (/dashboard/generate). Every endpoint writes a hashed audit record — no raw content ever touches a log.

The frontend is a VS Code-style enterprise workspace built in React, TypeScript, Vite, and Tailwind. Three resizable panes, two personas (Analyst and Reviewer), a forensic dock with live ε accounting, and a three-lane view of proxy sent → cloud response → rehydrated answer—D3 and visx power the timeline and signal visualizations. TanStack Table handles the virtualized dataset view.

The integrations: Gemini 2.5 Flash (DeepMind) as the primary reasoning engine. ClickHouse Cloud for persistent audit log analytics — every pipeline call writes a row, and compliance teams can run SQL over the full history. Datadog LLM Observability traces every model call with latency, token estimates, and status — canary leak events trigger immediate alerts. Senso as a persistent knowledge base — upload clinical documents once, and the assistant references them across sessions. Supabase for authentication and per-user activity history.

The adversarial harness runs five attack classes against every proxy: verbatim scan, cross-encoder similarity, trained span inversion (DistilBERT token classifier), membership inference, and downstream utility regression. We don't just claim privacy — we measure it.

Challenges We Ran Into

The biggest finding came by accident. We swept the differential privacy parameter ε across a 10× range, expecting a smooth privacy-utility curve. Instead, the utility was identical at every ε to six decimal places while the noise magnitude varied by 10×. The noise was being injected correctly into the hidden state — but the decoder was ignoring it entirely. Under greedy decoding anchored to the placeholder-substituted input, the paraphrase is a deterministic function of the text, not of the noisy embedding. The (ε, δ)-DP guarantee holds mathematically on the embedding. It does not propagate to the text surface.

This is not a bug. It's a structural mismatch between where DP is defined and where the adversary operates. Any system that injects DP noise into a hidden state and decodes through an unmodified language model faces the same gap.

The second finding: routing dominates over noise. Abstract-extractable achieves 0.00 verbatim leak rate on efficacy values. DP-tolerant achieves 0.67 on the same category at the same ε. The DP parameter is identical. Only the routing decision differs. The router is the operative privacy control variable — not the noise level.

Accomplishments We're Proud Of

A working privacy proxy that actually measures its own guarantees rather than just asserting them. A five-attack adversarial harness with ground-truth labels that covers verbatim and trained-inversion threat models that mathematical DP analysis cannot address. A mechanistic negative result — formal DP on hidden-state representations does not propagate to text-surface privacy — that generalizes to any system with the same abstraction boundary. A VS Code-style enterprise workspace that makes the privacy pipeline invisible to the user while giving compliance teams full forensic visibility. And an honest account of where the system fails and why.

What We Learned

Mathematical DP guarantees are not evidence of text-surface privacy. Routing dominates over noise as the operative privacy control variable. The biggest leakage risk in clinical trials isn't PHI — it's IP. And a system that promises privacy without disclosing which task classes are content-coupled will, for those tasks, silently pick one horn of a dilemma the user didn't know existed. A well-aligned system surfaces that tradeoff explicitly.

What's Next

Decoder-aware DP — perturbation along the decoder's output-sensitive directions rather than isotropic noise. A trained privacy decoder — a seq2seq model trained on (noisy embedding → privacy-preserving text) pair,s so noise becomes a hard conditioning signal. Content-coupling detection with explicit UI signaling when a task hits the binary privacy-utility tradeoff. Enterprise deployment with hospital or CRO-hosted inference, centralized audit logging, SSO, and compliance dashboards for CISOs.

Built With

Python · FastAPI · PyTorch · Hugging Face Transformers · sentence-transformers · DistilBERT · Pydantic · React · TypeScript · Vite · Tailwind · D3 · visx · TanStack Table · Gemini 2.5 Flash · ClickHouse Cloud · Datadog · Senso · Supabase · Rényi Differential Privacy · HIPAA Safe Harbor · CTCAE v5.0 · ICH E2A · WHO-UMC

Built With

Share this project:

Updates