TrustedRisk Tools

Inspiration

Healthcare-AI tool surfaces today publish calculations without the context that makes them safe to act on. An LLM agent calls a "compute readmission risk" endpoint and gets back probability = 0.18. The number is meaningless without the calibration of the estimator on its training population, the conformal prediction set its uncertainty implies, the subgroup audit its fairness was checked against, and the audit hash that lets a regulator replay the decision a year later. Today every downstream caller is silently expected to invent this scaffolding, and they almost never do.

The clinical consequences are documented and large. A widely-deployed proprietary sepsis early-warning model was found to over-trigger on roughly 40% of unselected ED arrivals because its decision threshold drifted from the calibration cohort it was published on (Wong et al., JAMA Internal Med 2021). CMS Hospital Readmissions Reduction Program penalised hospitals roughly USD 580M per fiscal year for excess readmissions, with the calibration of the underlying risk model directly translating into per-hospital penalty exposure (CMS HRRP final rules 2023). Clinician adoption surveys consistently report distrust of AI outputs whose error envelope is opaque (Asan 2020, NEJM AI 2024).

Rule-based software cannot fix this. Calibrated probabilities require shrinkage from population data; conformal prediction sets require Romano-2020-style score quantiles computed against a calibration split; out-of-distribution abstention requires posterior-width inspection on a Beta-Binomial bin. Generative AI in the form of probabilistic inference, statistical learning, and language-grounded retrieval is what makes it tractable. The pre-condition for using any of those safely is that the tool surface itself carry the calibration substrate; otherwise the LLM at the call site has nothing to verify against. The bet behind TrustedRisk MCP is that this scaffolding belongs in the MCP server, not in the calling agent: a caller that is too lazy or too fast still gets a clinically defensible answer, and a caller that does care can read the entire trust trace.

What it does

TrustedRisk Tools is a FastMCP streamable-HTTP server that publishes 145 compute_* / detect_* / ground_* tools across 47 thematic bundles with the SHARP-on-MCP context-required header contract (X-FHIR-Server-URL, X-FHIR-Access-Token, X-Patient-ID + optional X-FHIR-Refresh-Token / X-FHIR-Refresh-Url for offline-access mode). The 47 bundles cover the verticals a hospital actually runs:

Vertical	Sample bundles
ED / acute care	sepsis, stroke reperfusion, trauma severity, DKA, AKI staging, contrast safety, ACS disposition
Inpatient	NEWS2, glycemic control, falls + delirium, peri-op risk, Caprini VTE, ARISCAT, RCRI
Discharge / continuity	LACE-band readmission, medication reconciliation, polypharmacy + DDI, caregiver hand-off, post-discharge FAQ
Specialty	cardiology depth (HEART / GRACE / TIMI), neurology depth (ICH / mRS / Hunt-Hess), GI hepatology, heme-onc, endocrine, infectious-disease, OB / preeclampsia, pediatric early warning + weight-based dosing, PGx
Documentation / coding	admission H&P, progress note, consult letter, discharge summary, ICD-10 + CPT auto-coding, coding audit
Payer / appeals	PA evidence pack, payer rules match, PA letter, appeal-likelihood, denial-letter parse, appeal letter, appeal escalation path
Quality / population	HEDIS aggregate, CMS Stars forecast (FY 2025 5/4/3-star benchmarks), care-gap priority ranking, syndromic surveillance, vaccine-reminder cohort, DP-noised outbreak heatmap
Cross-cutting	calibrated risk, fairness audit, conformal sets, counterfactual explanations, causal inference, federated learning

Every tool returns a typed Pydantic v2 artefact. Every probabilistic tool returns a calibrated point estimate plus its CI95, conformal prediction set, abstain flag, abstain reason (if any), and the citation list backing the recommendation. Every fairness-aware tool returns the subgroup audit it ran and the per-axis EOO + DP gaps it found.

Three concrete capabilities the surface delivers fall outside what rule-based software can do. Calibrated probabilistic estimation with population shrinkage - every probabilistic tool ships a 5-bin Beta-Binomial posterior whose CI95 widens automatically when bin sample size is small, and a Romano-2020 LAC conformal score that turns the point estimate into a marginal-coverage prediction set; both are statistical-learning constructs, not lookup tables, and both require a calibration cohort the rule-based path cannot represent. Out-of-distribution detection with structured abstention - when an input falls into the LACE calibration plateau, the posterior loses confidence and the tool returns abstain_recommended = True with abstain_reason = "lace_calibration_plateau_OOD"; the numeric mean is still computed for transparency, but the abstain banner forces any downstream LLM presenter not to cite it as a final estimate (a rule-based system answers; an OOD-aware system refuses). Grounded language generation with deterministic floor - the scribe family (compute_progress_note_draft, compute_admission_hnp_draft, compute_consult_letter_draft, compute_discharge_summary_draft) builds a Joint-Commission-shaped artefact from the FHIR bundle plus narrative blocks read verbatim from chart-attached DocumentReference SOAP sections, then optionally pipes the output through an LLM polish pass via LiteLLM (Ollama / Gemini / Claude / OpenAI); the polish is additive, cite-back IDs and structured fields are immutable, and a pure rule-based template cannot read free-text SOAP sections while a pure LLM cannot guarantee cite-back integrity. A fourth capability - Wachter-2018 minimum-distance counterfactuals (compute_counterfactual_explanation) - finds the smallest modification that flips a recommendation tier, gated by the fairness-aware variant on protected attributes per Kusner-2017 SCM.

How it's structured

The MCP server has six layers, each addressable independently. Tool surface (src/mcp_server/tools/*): the 145 callables, each registered with FastMCP via a register(mcp) shim so a marketplace consumer that wants only one bundle can subscribe to it without pulling in the rest. Bundle registry (src/mcp_server/tools/__init__.py): the 47-bundle map publishing the SMART-on-FHIR scopes each bundle needs. Calibration substrate: 5-bin Beta-Binomial posterior with automatic CI95 widening when bin sample size is small, reproducible from data/coefficients.json; recalibration scripts for Synthea-100k and MIMIC-IV demo checked into data/. Trustworthy-ML substrate (src/a2a_agent/*): 96 modules covering Romano 2020 LAC conformal, Romano 2019 CQR, Pleiss 2017 fairness/calibration tension, El-Yaniv-Wiener 2010 selective classification, Wachter 2018 counterfactuals, Athey-Wager 2019 honest causal forest CATE, Doubly-robust ATE (IPW + g-formula + AIPW), Fine-Gray competing-risks, McMahan 2017 FedAvg, Li 2020 FedProx, DP-FedAvg with cumulative-ε tracking. SHARP-on-MCP middleware (src/mcp_server/sharp/*): extracts the FHIR-context headers on every non-framework JSON-RPC method, binds a FHIRContext ContextVar that every tool reads, returns 403 with a spec_reference body when required headers are missing. OAuth substrate (src/mcp_server/oauth/*): RFC 6749 §4.4 client_credentials grant with HS256 JWT and tenant-allowlist validation.

Calibrated abstention as system posture

The 26 ABSTAIN cases out of 150 in the e2e harness are not deficiencies - they are the system's clinical commitment. Three abstain shapes. OOD calibration: LACE plateau in the Beta-Binomial posterior; numeric mean still computed for transparency but flagged "NOT calibrated for this patient". Missing clinician input: tools that require values not derivable from a FHIR bundle (NIHSS item scores, RECIST lesion table, PGx genotypes, FAQ question text) abstain rather than fabricate. Top-of-message banner "ABSTAINED on N step(s) - DO NOT present numeric outputs as final estimates" so the chat-LLM cannot silently render a number as if it were calibrated. Optional-step skip: a workflow's accessory step that abstains is reported separately ("skipped optional step for transparency") and does NOT flip the workflow-level abstain flag, so the rest of the bundle still completes and the gap is visible.

Why it matters - literature-anchored impact hypotheses

The pain points TrustedRisk Tools targets are documented dollar / outcome / time problems with literature evidence the hypothesis can be tested against.

Sepsis 1-hour bundle compliance and mortality. Each hour of delay in sepsis-bundle initiation increases mortality by approximately 4% (Seymour et al., NEJM 2017; Kumar et al., Crit Care Med 2006). The qSOFA / SOFA / lactate / empiric-antibiotic chain TrustedRisk publishes lets an agent precompute the bundle decision in under a second of wall-clock, with susceptibility-aware antibiotic selection and IV-to-PO de-escalation as an explicit workflow. Hypothesis: bundle initiation latency drops from human-timeline (median 180 minutes per CMS SEP-1 abstraction) to agent-timeline.

Readmission risk calibration vs HRRP penalty exposure. Hospitals lose USD 0–3% of Medicare reimbursement per HRRP performance year on miscalibrated readmission band selection (CMS HRRP 2023). TrustedRisk's calibrated LACE band ships ECE 0.0078 against a 0.05 preferred-gate threshold, with the Beta-Binomial posterior published on disk and the calibration script reproducible from scripts/build_all_artefacts.py. Hypothesis: a hospital substituting an opaque vendor model with a calibration-published model can directly audit its own HRRP exposure rather than waiting for a CMS reconciliation report.

Clinical documentation time burden. Clinicians spend roughly half their working day on EHR documentation (Sinsky et al., Annals Internal Med 2016). The scribe family produces a Joint-Commission-shaped admission H&P / progress note / consult letter / discharge summary with cite-back IDs in roughly 200 ms of deterministic computation, leaving the LLM polish optional. Hypothesis: a clinician using a TrustedRisk-backed scribe agent reduces per-encounter documentation to seconds of review rather than minutes of dictation, while the cite-back IDs preserve the audit chain a JCAHO surveyor would expect.

Antimicrobial stewardship. IDSA / SHEA stewardship guidelines call for de-escalation review at therapy day 2-3 (IDSA / SHEA 2016). The compute_antibiotic_de_escalation tool consumes culture susceptibility, current regimen, days-on-therapy, and IDSA "5 Cs" oral-switch criteria, returning the narrowest acceptable agent and an IV-to-PO eligibility decision. Hypothesis: stewardship decisions move from a daily round to an event-driven loop, with the audit chain logging every de-escalation rationale.

Clinical AI adoption gap. Recent surveys put hospital adoption of FDA-cleared AI clinical tools below 30% on average (NEJM AI 2024) with calibration opacity and audit absence cited among the top reasons. The MCP server's posture (calibration-on-disk, conformal-on-disk, audit-chain-on-output) directly targets the audit-trail dimension of that adoption gap.

Compliance, safety, validation

Standards adherence. SHARP-on-MCP context propagation per the vendor-neutral specification (sharponmcp.com) plus the Prompt-Opinion-namespaced ai.promptopinion/fhir-context capability, so any client of either spec sees the same contract. HL7 FHIR R4 native consumer with SMART-on-FHIR launch flow (PKCE + state + nonce). Bulk FHIR $export NDJSON streaming. HL7 v2.x pipe-delimited parser. HL7 C-CDA R2.1 XML chart parser. DICOM SR Part-10 explicit-VR LE binary parser. OMOP CDM v5.4 exporter producing PERSON / VISIT_OCCURRENCE / CONDITION_OCCURRENCE / MEASUREMENT / DRUG_EXPOSURE / NOTE rows. CDS Hooks v1.1 service. OpenAPI 3.1 spec with dual security schemes (OAuth client_credentials + API key).

Multi-tenant isolation. RFC 6749 §4.4 client_credentials grant with HS256-signed JWT bearer tokens. The tenant claim is matched against an allowlist of FHIR server URLs at request time, so a token issued for tenant A cannot be used to query tenant B even if the bearer token is otherwise valid. Single-tenant deployments operate on SHARP context alone.

Privacy + audit. Presidio analyzer + anonymizer on every chart text input. Append-only RFC 6962 Merkle audit chain with SHA-256 hashing, PHI-redacted before hashing, and reproducible byte-identical replay of any historical decision. Differential-privacy publication of subgroup statistics (Laplace mechanism, configurable ε, sensitivity 1) suitable for safe public release.

Regulatory artefact coverage. 14-section EU AI Act Annex IV regulatory pack at 100% artefact coverage (every section traces to a checked-in artefact). FDA SaMD analysis classifying the tool surface as Class II for most decision-support tools and Class III for the acute-care boundary subset, with a 510(k) + De Novo + PCCP submission strategy and a documented Pillar-3 (clinical validation) open item. ISO 13485 §4 / §7 / §8 design-controls checklist with the gap to a Class II audit explicit. Mitchell 2019 Model Card + Gebru 2021 Datasheet for the calibration cohort. NIST AI RMF 1.0 + OECD AI Principles 2019 crosswalks (19 rows total). HIPAA §164 crosswalk (11 controls mapped). GDPR Article 35 DPIA.

Safety + adversarial robustness. Red-team v3 multi-target campaign (110 prompts) + v4 indirect-prompt-injection corpus (14 cases embedded in Observation.note / MedicationRequest.dosageInstruction.text / Patient.alias); every v4 case lands in safe posture. ADWIN + DDM concept-drift detectors for streaming production predictions. Liu-2020 energy-based OOD detector. FGSM-style adversarial perturbation analysis with finite-difference gradients.

Validation evidence. 4224 unit + integration + golden + adversarial tests passing. ECE 0.0078 internal calibration on n=7,880 Synthea cohort; ECE 0.0001 on Synthea-100k recalibration; ECE 0.0187 on MIMIC-IV demo (n=275, external validation). Federated global ECE 0.0010 vs centralized 0.0020. Cost simulator: USD 1.8M saved + 6.25 QALYs gained on a 10,000-cohort simulation versus the LACE-only baseline.

Honest take on what we resolved + what is still open

✅ Resolved during the v1.0 push

27 specialist routes that previously crashed with TypeError: missing N required positional argument when the dispatcher overlay didn't carry a kwarg the tool required. Root cause: Route.inputs={} on every entry, expecting the FHIR overlay to populate kwargs by name; for ~25 tools the required parameters were specific labs / scores not derivable generically. Fix in three parts: extended the overlay with LOINC-mapped lab values (creatinine baseline + current, pH, HCO3, glucose, AST/ALT, platelets, bilirubin, hemoglobin, WBC, BUN, INR, gestational age weeks); wrapped the dispatcher with TypeError -> structured abstain_reason="missing_clinician_supplied_inputs"; removed 5 population-level routes that were misconceived for single-patient bundles. Today: 0 CRASH on 150 e2e cases.

LOINC overlay coverage for AKI staging, DKA severity, contrast safety, preeclampsia, UGIB Glasgow-Blatchford, hepatitis Maddrey, myeloma ISS, MDS IPSS-R, sepsis SOFA / APACHE II, and others. Roughly 25 tool families now read their structured kwargs from the bundle.

Smart scribe extraction: when a chart-attached DocumentReference carries SOAP sections (HPI / PMH / PSH / FH / SH / ROS / PE / labs / imaging / hospital course / discharge disposition / patient instructions / follow-up plan / consultation reason), the four scribe tools read them verbatim before falling back to the abstain path. Strict header regexex so (for example) "PE\n" matches the Physical Exam header but not "Pe" inside "Penicillin".

Calibrated abstention as first-class system posture: optional workflow steps that abstain no longer flip the workflow-level abstain flag. They remain enumerated in abstained_steps with an optional: true marker so the chat-LLM can present "workflow completed; N optional steps skipped" rather than "abstained" when the gap is non-blocking.

Encounter id resolution from fullUrl for transaction-bundle ingest. FHIR transaction Bundles use POST + urn:uuid:<uuid> fullUrls; pre-server-commit the resource has no canonical id. We fall back to the trailing UUID, which is what every server uses to assign the id post-commit anyway.

🟡 Open / out of scope for v1.0 (documented as next-priority)

Tool-list scaling on context-constrained clients. The full 145-tool catalog ships as a single tools/list response (~210 KB of JSON schema). Claude 4.x clients handle this without quality degradation; Gemini 2.5 Pro starts losing tool-selection precision around 130 tools; smaller / older models (GPT-3.5, Llama-3-8B) may exceed their tool-count cap. The architecture already supports per-bundle subscription via the register(mcp) shim per bundle - exposing a ?bundle= query param on tools/list is the documented escape hatch, mechanical to add but not shipped in v1.0.

compute_treatment_selection handler library: 4 conditions shipped (atrial-fib anticoagulation, pre-op bridging, HF reduced EF, DM2 second-line). The harness flagged rheumatoid-arthritis as not-in-database; expanding to ~20 conditions is mechanical work but real clinical content review per condition.

PGx genotypes: real genotype data requires a SMART-on-FHIR consent flow plus an Observation resource with the GENO category. Today the tool astiene with no_genotypes_supplied; the workflow declaring it required is a legitimate clinical abstain, not a bug.

RECIST lesion table: same shape - real-time clinician input, not derivable from a bundle.

Multi-bundle SOAP coverage: Marcus is the only demo patient with a complete SOAP DocumentReference. Eleanor is intentionally the negative-example case (scribe tools astiene with missing_clinician_inputs); extending to Nadia + Sofia is one generator script away.

Live federated learning: today's federated path is calibration on a 5-site simulation; live cross-institution training requires a DUA + IRB at each site. Documented in docs/PROSPECTIVE_STUDY_PROTOCOL.md.

4 NETDEP tools in the offline test harness: compute_resolve_active_meds, compute_fetch_patient_documents, ground_claim, compute_readmission_risk instantiate get_fhir_client() directly rather than through the monkey-patched fetch_patient_bundle, so the in-memory test harness sees a ClientConnectorError to the fixture URL. Production unaffected (real SHARP context). Out-of-scope clean-up.

Clinical validation against real-EHR data: the binding open item per the FDA SaMD analysis. A prospective cohort study at ≥1 institution with n≥1000 paired clinician judgements is required and out of scope for a research prototype.

What's next (alpha)

Expand compute_treatment_selection handler library to 20 conditions with literature-grounded recommendations, closing the largest tool-database gap surfaced by the harness.
SMART-on-FHIR consent flow for PGx genotype + RECIST lesion ingestion, closing the two largest "real-time clinician input" abstain families and turning two ABSTAIN-by-design workflows into PASS workflows.
Continuous calibration loop: nightly recalibration of the LACE coefficients against streaming FHIR bundles, with drift detection (ADWIN + DDM) gating the swap.
Cross-engine arbitration: a second TrustedRisk Tools instance running a different bundle catalog should be invokable as a peer from the same caller. Architecture supports it; we have not exercised it.