ProtoCheck

ProtoCheck

Inspiration

Project Story

Inspiration

Protocol amendments cost the pharmaceutical industry $535K per amendment and add 3+ months to clinical trial timelines (Tufts CSDD). The root cause: regulatory compliance gaps are discovered late — after protocols are finalized, sites are activated, and patients are enrolled. We asked: what if AI could catch these gaps before a protocol leaves the drafting table?

We're a 2-person team (regulatory affairs + engineering) who watched firsthand how RA teams manually cross-reference 500-page protocols against FDA, EMA, ICH, and WHO guidelines — a process that takes weeks and still misses issues. ProtoCheck was born from the conviction that AI can do this faster, more consistently, and with full traceability.

What it does

ProtoCheck is an AI-powered clinical trial protocol compliance platform that:

Ingests clinical trial protocols (PDF upload with OCR quality validation and version detection)
Structures them using a Protocol Structure Ontology (PSO) — mapping protocol sections to 12 canonical categories aligned with ICH M11 CeSHarP
Analyzes each section against a 92-rule regulatory catalog spanning 5 authorities (FDA, EMA, ICH, WHO, MOHAP) using Claude AI via Amazon Bedrock
Scores compliance with severity-anchored findings (Critical / Major / Minor), false-positive filtering, and determinism verification
Reports findings with evidence deep-linking — each finding traces back to the exact protocol section, the governing regulatory rule, and the authority source

The output: a compliance report with actionable findings that helps regulatory affairs teams fix protocol gaps before they become costly amendments.

Real results from our validation cohort: 9 protocols across 6 therapeutic areas, 5 sponsors — 203 findings identified, 7.4% false-positive rate (target was <20%), and determinism verified at 94.59% Jaccard similarity across repeated runs.

How we built it

ProtoCheck runs on a fully serverless AWS architecture — zero idle cost, per-invocation billing, and multi-tenant data isolation:

Analysis Pipeline (AWS Step Functions, 31 states):

Upload (S3 Presigned) → Extract Protocol → Conformance Preprocessing
→ Vectorize (Titan v2, 1024-dim) → Prepare Assess Shards
→ Assess Shards (Map parallel, Claude via Bedrock, temp=0.0)
→ Aggregate (rule_id + section_id dedup) → Evaluate (9 quality gates)
→ Generate Findings → Compliance Report

Key architectural decisions:

Sharded assessment: Each protocol section is assessed independently via Map-state parallelism — eliminates Lambda timeout issues on large protocols and enables deterministic per-section analysis
Protocol Structure Ontology (PSO): 60 section templates + 96 rule-to-section mappings in DynamoDB. The conformance pipeline classifies each protocol chunk before assessment, enabling precision rule routing (only fire relevant rules per section)
Severity anchoring: Each rule has a canonical severity. The LLM can deviate with justification, but unjustified deviations are auto-corrected — solving the "LLM severity drift" problem
Determinism verification: An 8-gate automated suite (Jaccard similarity, score variance, finding-count delta, critical stability, authority-score drift) runs after every pipeline change
Per-tenant S3 isolation: Each customer gets a dedicated S3 bucket (ctwise-app-{tenant_id}-{env}) — protocols never co-mingle

Frontend: React 18 + TypeScript + Vite + Tailwind CSS, with Zustand state management and real-time WebSocket progress updates during analysis.

GxP compliance: 21 CFR Part 11 audit trail (append-only S3 JSONL, 7-year retention), cascade soft-delete with data retention, full traceability from requirements through deployment.

Challenges we ran into

LLM determinism: Claude's probabilistic nature meant repeated analyses of the same protocol produced different findings. We solved this through temperature=0.0, content-addressed deduplication (rule_id, canonical_section_id, evidence_hash), fan-out caps, and the 8-gate determinism verification framework. Final result: 94.59% Jaccard similarity across runs.
False positives from protocol fragments: Amendment histories, abbreviation tables, and truncated sections were triggering compliance findings. We built content-type heuristics (Strategies A through F) that classify and filter non-substantive content before assessment — reducing findings by 24.8% with zero genuine finding regression.
Severity calibration at scale: With 92 rules and 5 regulatory authorities, keeping severity ratings consistent required expert calibration. We partnered with a regulatory affairs expert to calibrate every rule with confirmed severity, scope tiers, and modifier conditions — then enforced these via prompt-level constraints and auto-correction.
Multi-authority regulatory mapping: Different authorities (FDA vs EMA vs ICH) sometimes have overlapping or conflicting requirements. The rule registry design with authority-specific scoping and cross-reference mapping handles this without double-counting.

Accomplishments that we're proud of

92 regulatory rules across 5 authorities (FDA, EMA, ICH, WHO, MOHAP), each expert-calibrated with severity, scope, and section mappings
9-protocol validation cohort spanning oncology, vaccines, CNS, rare disease, cardiovascular, and infectious disease — with a 7.4% false-positive rate
ICH M11 CeSHarP conformance — our PSO aligns with the emerging international standard for clinical protocol structure
40+ engineering sprints from concept to production-ready, with full GxP traceability (316+ verification checklist items, 6-dimension traceability matrix)
100% Infrastructure as Code — 16 CloudFormation modules, all ARM64 Lambdas, zero console-created resources
$1-2/month vector storage using S3 Vectors with Titan v2 embeddings (replaced Aurora pgvector at $90-100/month — 98% cost reduction)

What we learned

Determinism is a feature, not an afterthought: In regulated industries, "the AI gave a different answer this time" is a non-starter. Building determinism verification into the pipeline from the start — not bolting it on later — was the most important architectural decision we made.
Expert calibration beats prompt engineering: We spent weeks tuning prompts to get consistent severity ratings. What actually worked was encoding expert-calibrated canonical severities into the rule registry and using auto-correction as the enforcement mechanism. The LLM deviates ~60% of the time regardless of prompt constraints — the correction layer is what aligns output.
Serverless + sharding = cost-effective AI pipelines: By sharding assessment across protocol sections (one Bedrock invocation per section), we eliminated timeout issues, enabled parallelism, and kept per-analysis costs proportional to protocol size.
GxP compliance from day one saves time: Retrofitting audit trails and traceability is painful. Building them into the architecture from Sprint 1 meant every subsequent feature got compliance "for free."

What's next for ProtoCheck

Design partner onboarding: Currently validating with pharma RA teams and CROs in the tri-state area. Targeting 3-5 design partners by Q3 2026.
ICH M11 Phase B/C: Full M11 structural conformance checking (Phase A complete with 58-section CeSHarP template)
CDISC alignment expansion: Deeper integration with CDISC standards for protocol-to-data mapping
Multi-document analysis: Cross-referencing protocols against Investigator Brochures, Statistical Analysis Plans, and prior protocol versions
AWS Marketplace launch: Contract-based subscription (Professional / Enterprise / Elite tiers) via AWS Marketplace Solutions