Inspiration

The OECD published a 98-page working paper in March 2025 estimating that diagnostic errors cost the United States $870 billion per year — 17.5% of total healthcare expenditure. One of the most documented and preventable causes isn't misreading a scan. It's the follow-up that never happened.

  • 35.5% of pulmonary nodule patients receive no follow-up imaging (AHRQ / Brigham & Women's Hospital, 2025)
  • 26% of BI-RADS 3 breast findings — 386 out of 1,511 patients — have no documented closure (AHRQ Lacson Report, 2025)
  • 1 in 7 radiology recommendations never achieve documented follow-up (JAMA Network Open, 2022)
  • 44% of diagnostic errors involve a failure to follow up on a lab or imaging result (Schiff et al., cited BMJ Open Quality 2023)
  • The follow-up and coordination phase is a cause or contributor in 46% of severe and fatal diagnostic adverse events (PMC, March 2025)

The problem isn't that clinicians don't care. It's that no system is watching the loop.

Consider this scenario: a BI-RADS 4 mammogram is filed. The radiologist moves on. The referring clinician never gets a callback. The patient never hears back. Sixty-three days pass. A suspicious breast finding remains unresolved, and the window for earlier diagnosis narrows. Nobody lied, nobody was negligent — the loop simply was never watched.

Commercial platforms like Rad AI Continuity ($40M+ funded), Inflo Health, and Vital Guard are racing to solve this for large health systems — but none of them work inside an AI agent platform via MCP. ResultLoop is an MCP-native implementation of closed-loop abnormal result follow-up auditing, built to run inside an AI agent platform rather than as a standalone enterprise workflow system.

This is distinct from HEDIS quality measure gap programs, which track whether preventive screenings were ordered. ResultLoop audits the step after a critical finding has already been returned — whether the abnormal result has documented follow-up or is silently sitting open in the chart.

What it does

ResultLoop is a SHARP-on-MCP clinical safety server that audits patient FHIR records for abnormal lab and imaging results lacking documented follow-up closure. It works inside Prompt Opinion: any agent can invoke it against a patient's live FHIR record to discover open loops, rank them by risk, and draft closure actions for clinician review.

14 tools across 4 workflow stages:

DISCOVER

  • find_recent_abnormal_results — scans Observations and DiagnosticReports for abnormal/critical/panic-value flags
  • get_result_followup_context — searches ServiceRequest, Task, Communication, CarePlan, Encounter for closure evidence; returns OPEN_LOOP or CLOSED
  • get_patient_loop_history — detects longitudinal closure-failure patterns across 24 months (repeat misses)

EVIDENCE

  • get_clinical_guideline — searches PubMed E-utilities (esearch → esummary) for real PMID-cited clinical guidelines by finding type (BI-RADS, Lung-RADS, CA-125, etc.); returns journal metadata and direct PubMed URLs; no patient data sent to NCBI

ANALYZE

  • rank_open_loops_by_severity — deterministic Closure Risk Index (CRI) using BI-RADS, Lung-RADS, HbA1c, eGFR, TSH, potassium, FIT, hemoglobin thresholds — no LLM touches the score
  • generate_result_closure_summary — structured clinical safety artifact with evidence matrix and per-result risk assessment
  • verify_closure_claim — hallucination guardrail; returns PASS / CONTRADICTION / INSUFFICIENT_EVIDENCE

ACT

  • draft_followup_task_resource — draft-only FHIR R4 Task + AuditEvent with SHA-256 hash; never writes to the chart
  • write_followup_task — writes FHIR Task + Provenance to the chart; requires explicit clinicianApprovalConfirmed: true in the same conversation turn — blocked by parameter-level validation, not just prompt
  • draft_outreach_message — multilingual patient outreach (English / Spanish / Hindi) with 21st Century Cures Act framing
  • generate_audit_report — self-contained HTML clinical safety report — donut chart, severity cards, evidence matrix
  • write_audit_report_document_reference — writes the HTML audit report as a permanent FHIR DocumentReference (LOINC 11503-0, base64 embedded, linked Provenance) — survives session expiry, readable by any downstream EHR
  • write_detected_issue — writes a FHIR DetectedIssue (CAREGAP / care gap) for each open-loop result; severity mapped from CRI score; gated by explicit clinician approval
  • send_sms_outreach — sends clinician-approved patient outreach SMS via Twilio Programmable Messaging; gated by explicit clinicianApprovalConfirmed: true in the same conversation turn; E.164 phone validation; English (en) only on current Twilio trial account; never sends autonomously

Safety by design: Action-oriented writes such as Task, DetectedIssue, and SMS outreach require explicit clinician approval in the same conversation turn. Audit report persistence requires a valid generated reportId and writes a DocumentReference + Provenance record. ResultLoop identifies and audits — it never diagnoses, never treats, never overrides clinical judgment.

How we built it

  • Runtime: TypeScript + Node.js on Express
  • MCP transport: Streamable HTTP (SHARP-on-MCP, ai.promptopinion/fhir-context extension declared in capabilities)
  • FHIR R4: reads Observation, DiagnosticReport, ServiceRequest, Task, Communication, CarePlan, Encounter; writes Task + Provenance + DocumentReference (LOINC 11503-0, base64 HTML audit report) + DetectedIssue (CAREGAP, severity=high/moderate/low). Task and DetectedIssue writes require explicit clinician approval; DocumentReference persistence requires a valid generated reportId and linked Provenance.
  • PubMed: NCBI E-utilities two-step pipeline (esearch → esummary); real PMIDs verified against live PubMed; no patient data transmitted to NCBI
  • Deterministic scoring: Closure Risk Index (CRI) — totalPriorityScore = severityScore + timelinessScore; severity classified from BI-RADS / Lung-RADS / lab thresholds (CRITICAL→90, HIGH→60, MODERATE→30, LOW→10); timeliness step-scored by days since result (≥45d→30, ≥30d→20, ≥14d→10); no LLM touches the score
  • Integrity: SHA-256 closure artifact hash on every audit; Provenance chain on every FHIR write
  • Patients: 5 synthetic FHIR transaction bundles (Maria Lopez, James Chen, Sarah Williams, Robert Johnson, Priya Patel) across oncology, nephrology, endocrinology, cardiology, pulmonology
  • Deployment: Render — live at https://resultloop-mcp.onrender.com/mcp
  • Twilio SMS: native fetch-based Twilio Programmable Messaging integration (no SDK); HTTP Basic auth; E.164 phone validation; 155-char GSM-7 hard cap (single segment on Twilio trial); same-turn clinician approval gate enforced at parameter level; English-only SMS delivery (ES/HI drafts generated but blocked at send-time on trial account)
  • Validation: 17 labeled test scenarios across 5 synthetic patients — 50 assertions, zero false closures, all safety gates verified

Challenges we ran into

  • Per-result closure isolation: A follow-up ServiceRequest for one condition must not count as closure evidence for a different abnormal result. Built a custom result-linkage resolver (result-linkage.ts) that deep-walks all reference values in a FHIR resource JSON and matches by result ID, combined with code-term matching on LOINC display names — so a Task linked to a mammogram DiagnosticReport does not count as closure for an unrelated CA-125 Observation. This is one of the hardest engineering problems in the domain — and a major reason funded commercial platforms invest heavily in closed-loop follow-up infrastructure.
  • Deterministic vs. LLM ranking: Initial severity ranking used LLM judgment, which varied run-to-run. Replaced with the deterministic CRI formula so ranking is reproducible, citable, and audit-safe.
  • Approval gate integrity: write_followup_task must never fire without explicit approval — enforced at both the system-prompt level and parameter level (clinicianApprovalConfirmed: boolean). Tested with an adversarial "write it anyway" prompt — correctly blocked.
  • Visual audit report: Generated a self-contained HTML report with live donut chart and severity cards from pure MCP tool output, served directly from the Express server — no frontend build step.

Accomplishments that we're proud of

  • 50 assertions pass across 17 labeled test scenarios — open-loop detection, false-close resistance, FHIR write, multilingual outreach, safety gates, PubMed evidence retrieval, Twilio SMS delivery
  • Deterministic Closure Risk Index — a citable, auditable severity formula grounded in AHRQ published findings, equivalent to how Naranjo scoring works in pharmacovigilance
  • 4 FHIR resource types writtenTask, Provenance, DocumentReference (LOINC 11503-0, base64 HTML), DetectedIssue (CAREGAP) — action-oriented writes require explicit clinician approval, and persistence writes include linked Provenance
  • Permanent EHR-native audit record — the HTML audit report is embedded as base64 in a DocumentReference, survives server restarts, readable by any FHIR R4-compliant system
  • Tri-state verificationverify_closure_claim returns PASS / CONTRADICTION / INSUFFICIENT_EVIDENCE — a hallucination guardrail unique to our pipeline; a hallucinated closure is more dangerous than no answer
  • Multilingual outreach — drafts in English / Spanish / Hindi with 21st Century Cures Act framing; current Twilio trial delivery is English-only, with same-turn clinician approval gate enforced at the parameter level
  • Longitudinal failure detection — 24-month repeat-miss pattern surfaces patients who have fallen through the cracks before, not just the current open result
  • Concrete output example: for Maria Lopez, rank_open_loops_by_severity returns totalPriorityScore: 90 for the BI-RADS 4 mammogram (HIGH severity score 60 + ≥45 days timeliness score 30), ranked above the CA-125 elevation — deterministic, identical across every run

Benchmark

Eval corpus: 5 synthetic patients × 17 labeled scenarios × 50 assertions. Pipeline is fully deterministic — results are identical across runs.

Benchmark Score
Open-loop detection (True Positive Rate) 100% — 11/11
False closure rate 0% — 0/11
Hallucination guardrail (verify_closure_claim) 100% — 3/3 verdicts correct
CRI severity ranking correctness 100% — 3/3 rank orders correct
FHIR write gate integrity 100% — blocked without approval, writes with approval
Multilingual outreach (EN / ES / HI) 100% — 3/3 languages correct
PubMed evidence retrieval 100% — real PMIDs verified; graceful no-results handling
Twilio SMS delivery + approval gate 100% — English SMS delivered to verified number (SMce30...); blocked without same-turn approval
Total 50/50 assertions pass

What we learned

  • The most dangerous moment in clinical AI is a confident wrong answer. The tri-state verification (PASS / CONTRADICTION / INSUFFICIENT_EVIDENCE) exists specifically because a hallucinated closure is more dangerous than no answer at all.
  • FHIR result linkage is the hardest part. Most commercial follow-up platforms cost millions because this is genuinely hard — determining which follow-up belongs to which result requires reasonReference, basedOn, focus, and temporal proximity, not just date matching.
  • Deterministic scoring beats LLM scoring for clinical triage. The CRI formula produces the same ranking every run. Judges, clinicians, and auditors can verify it. That reproducibility is what makes it trustworthy.
  • The $870B problem is real. OECD, AHRQ, and five funded startups all independently converged on the same problem. MCP is the missing interoperability layer that lets this capability work inside any agent workflow.

What's next for ResultLoop

  • rank_patients_by_loop_risk — multi-patient ward view; a charge nurse selects a department and sees every patient ranked by CRI in one call, not patient-by-patient
  • A2A agent layer — ResultLoop Auditor as a standalone A2A agent exposing verify_closure_claim as a public safety primitive; any agent on the Prompt Opinion platform can call it without owning the full audit pipeline — a cross-agent hallucination guardrail
  • False-positive analysis — expanded eval corpus with precision/recall breakdown per result type (BI-RADS, Lung-RADS, lab panels); current corpus is 100% TP / 0% FP but needs wider coverage before production deployment
  • EHR read-back — query the written DetectedIssue and Task resources post-session to confirm FHIR persistence and close the audit loop end-to-end

Judge verification

The uploaded judge setup package includes synthetic FHIR patient bundles, Prompt Opinion agent setup instructions, and testing output for the 17 validation scenarios.

Built With

Share this project:

Updates