TrustedRisk Care Engine

Inspiration

A clinician's question almost never lives inside one specialty. A patient with worsening shortness of breath needs ED triage, deterioration scoring, a differential diagnosis, a polypharmacy check, and an admission H&P drafted before the team rounds again. Five specialist contexts share the same chart, and a wrong call in any of them contaminates the others.

The clinical consequences of fragmented coordination are documented and large. Communication failures during transitions of care are implicated in roughly 80% of serious medical errors that reach the patient (Joint Commission Sentinel Event Alert 58, 2017). Each cross-specialty handoff loses an estimated 12-15% of clinically relevant context (Starmer et al., NEJM 2014, I-PASS trial). Coordination overhead consumes roughly 30% of the inpatient physician's day (Tipping et al., J Hosp Med 2010). Diagnostic errors affect roughly 12 million US adults per year, with a meaningful share traceable to incomplete cross-specialty consultation (Singh et al., BMJ Qual Saf 2014).

Most healthcare-agent demos answer this with a flat tool surface plus whatever chat LLM is in front of it. The LLM hallucinates one wrong call and contaminates the whole answer; there is no narrated trace of which specialist was consulted, no abstain primitive that propagates from one tool to another, no deterministic floor when the model is slow, unavailable, or simply confused by a clinically ambiguous prompt. A pure rule-based router can keyword-match a synonym list, but cannot understand "she's been thinking about ending things and her INR is 5" as a parallel fan-out across crisis triage and anticoagulation review. Generative AI in the form of intent recognition, parameter resolution, and narrated dispatch is what makes structured multi-specialty coordination tractable; a deterministic floor below it is what makes the system safe to ship.

The bet behind TrustedRisk Care Engine is that orchestration intelligence belongs in the federation, not in the chat caller. The clinician asks one question; the orchestrator picks a workflow from a 257-entry catalog, dispatches it across 16 specialist sub-agents over real A2A v1 HTTP, narrates which specialist it consulted at each hop, and returns one structured answer assembled from each specialist's typed artefact. The deterministic keyword router below the LLM dispatcher means the engine never silently fails; the per-step audit-chain entry means a regulator can replay every consultation step a year later.

What it does

A free-form clinical prompt + a PO FHIR-context extension hits the orchestrator's A2A v1 endpoint at POST /a2a/trustedrisk-orchestrator/. The dispatcher reads the prompt, picks one of three execution shapes. A prebuilt workflow when the prompt names a clinical scenario the catalog covers - "use TrustedRisk to get a complete sepsis pipeline on this patient" resolves to complete_sepsis_pipeline, a macro that runs sepsis_workup then antibiotic_stewardship then transitional_care_management against the same patient context, producing a per-step trace across four specialists. A single specialist consultation when the prompt maps cleanly to one skill - "insurance denied her infliximab; push back" routes directly to trustedrisk-appeals for an appeal-letter draft. A parallel fan-out when the prompt is genuinely multi-faceted - "she's been thinking about ending things and her INR is 5" produces a fan-out across the mental-health crisis specialist (C-SSRS scoring + admission decision) and the discharge specialist (warfarin DDI scan).

The dispatcher is two-layered. Layer 1 - LLM ranker (Anthropic Claude Haiku 4.5 by default through LiteLLM, swappable via ADK_MODEL) reads the catalog and returns 1-3 picks with one-line rationale, in a strict KIND|TARGET|RATIONALE line format the parser validates before execution. A 2.5-second wall-clock cap keeps the dispatcher under the chat-client soft timeout. Layer 2 - deterministic keyword router scans the prompt against ~60 workflow-id keyword tuples and the per-specialist route table; the longest match wins. When the LLM is unavailable, slow, or returns garbage, the engine falls through cleanly to the keyword path.

Three concrete capabilities the federation delivers fall outside what rule-based software can do. Free-form prompt to multi-specialist dispatch - parsing "use TrustedRisk to get a complete sepsis pipeline on this patient" into a macro id + the FHIR context binding, or "she's been thinking about ending things and her INR is 5" into a parallel fan-out across the crisis specialist (C-SSRS) and the discharge specialist (warfarin DDI scan), requires LLM intent recognition; a keyword router catches synonyms but never the second prompt's dual-axis intent. Narrated dispatch as audit primitive - every routing decision returns a one-line clinician-readable rationale ("recognised as complete_sepsis_pipeline macro because prompt names the protocol explicitly") that lands in the audit chain entry per dispatch; rule-based routers don't explain themselves, LLMs do naturally, and the orchestrator captures both the decision and its rationale alongside the per-step trace. Two-layer LLM-plus-deterministic dispatcher - pure LLM dispatchers fail open (silent nothing on key exhaustion), pure keyword routers fail closed (only known scenarios match); the composition gives robustness to LLM provider outages and to adversarial prompts that bypass the LLM ranker, because the keyword router still matches the structured part of the prompt and routes deterministically.

The three layers of the workflow registry

The 257-entry catalog is composed in three layers.

Base workflows (58)

Hand-crafted clinical recipes. Each declares 2-5 step tools and either runs them in parallel (when the protocol is step-independent, like a sepsis bundle) or chains them on prior-step outputs (when the protocol gates downstream tools on upstream decisions). Real chains where the engine depends on upstream output: AKI staging -> dialysis-initiation decision (${steps.staging.output.aki_stage} flows into the dialysis evaluator); HEART -> GRACE -> TIMI -> ACS disposition (4-step chain where the disposition tool reads all three risk scores and returns a STEMI cath / NSTE-ACS observation / discharge verdict that cannot contradict the inputs); empiric antibiotic -> ID consult letter (consultation_reason chains on ${steps.antibiotic.output.rationale}); PA evidence -> payer rules -> letter draft + approval likelihood (4-step DAG).

Macro care-arcs (50)

A macro is a workflow whose every step is a sub_workflow_id reference. The execution engine recurses into the named sub-workflow and returns a flat {inner_step_id: output_dump} dict so downstream chain references like ${steps.<hop>.output.<inner_step>.<field>} keep navigating cleanly. Examples: complete_chf_admission, complete_sepsis_pipeline, surgical_full, pa_with_preemptive_appeal, mh_arc, ob_full, peds_full, pop_health_full, anticoag_lifecycle, oncology_arc. Every macro is chained by construction: hop_2's patient is the same patient hop_1 just operated on, hop_3 sees the chart hop_2 documented. The recursive backend forwards the FHIR-context and the resolved input dict on each hop.

Parametric variants (149)

A factory clones a base workflow with a new id and an input overlay that targets the axes a clinical decision actually depends on: payer, age band, severity, specialty, discharge disposition, outreach channel, chronic condition, days post-discharge, anticoagulant, vaccine. The parametric layer exists because the same "ED chest pain workup" produces a different rationale, consult target, and follow-up depending on whether the HEART score is low, moderate, or high. Variants make the difference visible in the catalog instead of hiding it inside a runtime conditional.

Total: 257 workflows, 146 of them data-dependent chains (56%).

Real A2A between sub-agents

The "agent composition" claim has to hold up at runtime, not just in the agent cards. TRUSTEDRISK_COMPOSER_BACKEND controls the execution mode. With inprocess (default) tool callables are imported from mcp_server.tools.* and awaited directly. With a2a each step becomes a tools/call JSON-RPC over HTTP via FastMCP's StreamableHttpTransport against the specialist's MCP endpoint, with SHARP context propagating from the orchestrator's bound ContextVar through the _headers() indirection in the A2A backend. The orchestrator and the specialists are then genuinely separate agents communicating through the same A2A v1 surface a marketplace consumer would see. The HTTP path is exercised by an integration test (tests/integration/test_a2a_backend.py) that boots the discharge specialist on a random local port via uvicorn, points the A2A backend at it, runs tools/call for detect_phi end-to-end, and asserts the redaction-map round-tripped through real sockets.

Calibrated abstention as system posture

The 26 ABSTAIN cases out of 150 in the harness are not deficiencies - they are the system's clinical commitment. Three abstain shapes. OOD calibration: LACE plateau in the Beta-Binomial posterior; numeric mean still computed for transparency but flagged "NOT calibrated for this patient". Missing clinician input: tools that require values not derivable from a FHIR bundle (NIHSS item scores, RECIST lesion table, PGx genotypes, FAQ question text) abstain rather than fabricate. Top-of-message banner "ABSTAINED on N step(s) - DO NOT present numeric outputs as final estimates" so the chat-LLM cannot silently render a number as if it were calibrated. Optional-step skip: a workflow's accessory step that abstains is reported separately ("skipped optional step for transparency") and does NOT flip the workflow-level abstain flag, so the rest of the bundle still completes and the gap is visible.

The 6 workflow-level ABSTAIN are all clinically legitimate: caregiver_handoff_prep (handoff IS the workflow's purpose; without upstream DecisionCard it can't fabricate one); pgx_prescribing_check + transplant_immunosuppression_review (no PGx genotype data in the bundle); oncology_cycle_review (no RECIST lesion table); rheumatology_followup (condition not in the compute_treatment_selection handler library); pre_op_optimization for Eleanor (deliberate negative example - Eleanor's bundle has no DocumentReference; scribe tools astiene on missing_clinician_inputs).

Why it matters - literature-anchored impact hypotheses

The pain points TrustedRisk Care Engine targets are documented coordination / time / safety problems with literature evidence the hypothesis can be tested against.

Sepsis pipeline coordination time. Each hour of delay in sepsis-bundle initiation increases mortality by approximately 4% (Seymour et al., NEJM 2017). The complete_sepsis_pipeline macro composes sepsis_workup + antibiotic_stewardship + transitional_care_management across four specialists in a single dispatch. Hypothesis: the multi-specialty handoffs that today take a paged consultant + a chart review + a callback collapse into seconds of wall-clock with cite-back IDs preserving every step's rationale, addressing the bundle-initiation delay that the rule-based "page-and-wait" workflow structurally cannot.

Handoff communication errors. Communication failures during transitions are implicated in roughly 80% of serious medical errors that reach the patient (Joint Commission Sentinel Event Alert 58, 2017). The narrated dispatch trail - every specialist hop returns its rationale + the FHIR resources it consumed + the typed artefact it produced - replaces the verbal handoff with a structured one. Hypothesis: the I-PASS-style structured handoff (Starmer 2014 NEJM) becomes machine-generated from the federation trace, removing the most error-prone coordination step rather than relying on a tired clinician at 3 AM to get it right verbally.

Prior-authorisation burden. 94% of physicians report PA-driven care delays; 33% report serious adverse events from PA delays (AMA Prior Authorization Physician Survey 2022). The pa_with_preemptive_appeal macro composes evidence pack + payer-rules match + letter draft + appeal-likelihood + escalation path in one orchestrator dispatch. Hypothesis: PA cycle time drops from days-to-weeks to minutes, with the appeal letter pre-drafted before the denial arrives, and the audit-chain entry per dispatch documents medical necessity in a form a payer reviewer cannot dismiss as "unstructured".

Multi-axis crisis triage. Suicide is the 11th leading cause of death in the US (CDC WISQARS 2022); SAMHSA 988 routinely saturates within 30 seconds of an inbound call. The mh_arc macro fans out across the mental-health crisis specialist (C-SSRS scoring + admission decision) and the discharge specialist (medication-interaction scan including SSRI / TCA / lithium toxicity vectors). Hypothesis: parallel fan-out catches dual-axis risks (suicidality + drug interactions) that single-agent systems miss because the prompt was routed to one queue and the second axis never reached a specialist who could see it.

Adoption barriers in agentic clinical AI. Recent surveys put hospital adoption of FDA-cleared AI clinical tools below 30% on average (NEJM AI 2024) with audit absence and integration friction cited among the top reasons. The federation's posture - A2A v1 surface with .well-known/agent-card.json per specialist, marketplace manifest with explicit per-bundle ownership, narrated dispatch + per-step audit chain - directly targets the audit + integration dimensions of that adoption gap.

Compliance, safety, validation

Standards adherence. A2A v1 specification (Google ADK 2.x) with agent cards published at /.well-known/agent-card.json per specialist and the orchestrator. JSON-RPC over HTTP tools/call against each specialist's MCP endpoint via FastMCP StreamableHttpTransport. SHARP-on-MCP context propagation across the federation (every specialist hop receives the originating FHIR context unchanged). HL7 FHIR R4 native consumer with SMART-on-FHIR launch flow (PKCE + state + nonce). CDS Hooks v1.1 service. OpenAPI 3.1 spec for the orchestrator endpoint with dual security schemes (OAuth client_credentials + API key).

Multi-tenant + multi-specialist isolation. 16 specialists run in independent processes (ports 8770-8784, composer 8765); one specialist's compromise does not propagate to peers, and the orchestrator validates each tools/call response against the typed Pydantic v2 schema before composing it into the trace. RFC 6749 §4.4 client_credentials grant with HS256 JWT and tenant-allowlist validation applied at the orchestrator layer; per-specialist endpoints inherit the SHARP context through the same propagation contract.

Privacy + audit. Presidio analyzer + anonymizer on every chart text input across all specialists. Append-only RFC 6962 Merkle audit chain entry per dispatch, per specialist hop, and per tool call (3-layer trace) with SHA-256 hashing and PHI-redaction before hashing, reproducibly replayable byte-identical from the snapshot. Differential-privacy publication of cross-specialist usage statistics (Laplace mechanism, configurable ε) suitable for marketplace consumption metrics without leaking patient routing patterns.

Regulatory artefact coverage. 14-section EU AI Act Annex IV regulatory pack at 100% artefact coverage applies to the federation as a composed device. FDA SaMD analysis covers both individual specialists (most Class II, acute-boundary subset Class III) and the orchestrator's composition layer. ISO 13485 §4 / §7 / §8 design-controls checklist with explicit gap to a Class II audit. Mitchell 2019 Model Card per specialist + Gebru 2021 Datasheet for the calibration cohort. NIST AI RMF 1.0 + OECD AI Principles + HIPAA §164 + GDPR Article 35 DPIA crosswalks. Federation-specific: per-specialist agent cards published to docs/federation/marketplace_manifest.json at 100% bundle-ownership coverage, verified by tests/unit/test_federation_registry.py.

Safety + adversarial robustness. Red-team v3 multi-target campaign exercises orchestrator + specialists; v4 indirect-prompt-injection corpus embedded in chart text (Observation.note / MedicationRequest.dosageInstruction.text / Patient.alias) propagates through the dispatch chain - every v4 case lands in safe posture. The two-layer dispatcher means an adversarial prompt that bypasses the LLM ranker still hits the deterministic keyword router and is contained or refused. ADWIN + DDM concept-drift detectors monitor per-specialist artefact distributions for shift.

Validation evidence. 4224 unit + integration + golden + adversarial tests passing, including the e2e A2A backend test (tests/integration/test_a2a_backend.py) that boots a real specialist on a random local port via uvicorn, points the A2A backend at it, runs tools/call for detect_phi end-to-end through real sockets, and asserts the redaction-map round-tripped. 124/150 e2e PASS / 26 ABSTAIN / 0 CRASH on the orchestrator harness (scripts/smoke_demo_full.py); all 50 macro care-arcs PASS; 100% federation registry coverage; 100% specialist-bundle ownership.

Honest take on what we resolved + what is still open

✅ Resolved during the v1.0 push

27 specialist routes that previously crashed with TypeError: missing N required positional argument when the dispatcher overlay didn't carry a kwarg the tool required. Root cause: Route.inputs={} on every entry, expecting the FHIR overlay to populate kwargs by name; for ~25 tools the required parameters were specific labs / scores not derivable generically. Fix in three parts: extended the overlay with LOINC-mapped lab values (creatinine baseline + current, pH, HCO3, glucose, AST/ALT, platelets, bilirubin, hemoglobin, WBC, BUN, INR, gestational age weeks); wrapped the dispatcher with TypeError -> structured abstain_reason="missing_clinician_supplied_inputs"; removed 5 population-level routes that were misconceived for single-patient bundles. Today: 0 CRASH on 150 e2e cases.

LOINC overlay coverage for AKI staging, DKA severity, contrast safety, preeclampsia, UGIB Glasgow-Blatchford, hepatitis Maddrey, myeloma ISS, MDS IPSS-R, sepsis SOFA / APACHE II, and others. Roughly 25 tool families now read their structured kwargs from the bundle.

Calibrated abstention as first-class system posture: optional workflow steps that abstain no longer flip the workflow-level abstain flag. They remain enumerated in abstained_steps with an optional: true marker so the chat-LLM can present "workflow completed; N optional steps skipped" rather than "abstained" when the gap is non-blocking.

Encounter id resolution from fullUrl for transaction-bundle ingest. FHIR transaction Bundles use POST + urn:uuid:<uuid> fullUrls; pre-server-commit the resource has no canonical id. We fall back to the trailing UUID, which is what every server uses to assign the id post-commit anyway.

Chart-authoritative clinical inputs across 25 Tier-1 tools. A class of tools (ED triage, ICU severity scores, peri-op risk, transplant dosing, pediatric / obstetric early-warning, scribe family, PA evidence pack, polypharmacy detector) previously accepted lab values, vital signs, demographics, and medication lists as caller-supplied parameters. A chat-side LLM could fabricate a CYP2C19=poor_metabolizer or a lactate=4.1 to satisfy the schema and the tool would compute on the fabricated input. The v1.0 hardening (src/mcp_server/tools/_chart_inputs.py) intercepts these inputs at the top of each tool: when SHARP-on-MCP context is bound, demographics come from Patient.birthDate / Patient.gender, vitals come from the most-recent LOINC-coded Observations, labs come from LOINC-coded Observations (creatinine 2160-0, lactate 32693-4, INR 6301-6, bilirubin 1975-2, ...), gestational age comes from pregnancy-flagged Conditions / Observations, and medications come from MedicationRequest / MedicationAdministration. Caller-supplied values for these parameters are discarded when SHARP is bound; offline / unit-test invocations preserve legacy caller-supplied behaviour. PGx is the strict variant of the same posture (chart-empty -> abstain with no_pgx_observations_in_chart); the broader Tier-1 family is non-strict (chart-empty -> caller value falls through, the tool's existing missing-input check fires). The strict variant is on the v1.1 roadmap for the remaining Tier-1 tools.

🟡 Open / out of scope for v1.0

Cross-institution federation: the A2A v1 surface is symmetric; an external specialist hosted by another institution can be added with one entry in _SPECIALIST_PORTS (or under TRUSTEDRISK_FEDERATION_BASE_URL/<slug>/mcp). What is missing is a federated discovery protocol that lets one institution's marketplace surface another's specialists without manual registration.

Runtime-billed observability: per-step duration + output sizes are emitted; wiring them into a billing meter (cost-per-decision rather than cost-per-tool) is the path to a SaaS pricing model. The deterministic-floor / LLM-augmented split is the natural unit.

Cross-engine arbitration: a second Care Engine running a different workflow catalog could publish its agent card and be consulted as a peer; today the orchestrator doesn't enumerate peer engines.

What's next (beta)

Cross-institution federated discovery (one institution's marketplace surfaces another's specialists without manual registration).
Runtime-billed observability (cost-per-decision pricing model).
Multi-engine arbitration (second Care Engine as peer in the orchestrator's catalog).
Live A2A clinical validation against a partner institution's FHIR bundle, with paired clinician judgement.
Treatment-selection handler library expansion (closes 2 of the 6 legitimate ABSTAIN).