ResultLoop

architecture_diagram
tools_pipeline
workflow_diagram
validation_results
safety_gates

Inspiration

The OECD published a 98-page working paper in March 2025 estimating that diagnostic errors cost the United States $870 billion per year — 17.5% of total healthcare expenditure. One of the most documented and preventable causes isn't misreading a scan. It's the follow-up that never happened.

35.5% of pulmonary nodule patients receive no follow-up imaging (AHRQ / Brigham & Women's Hospital, 2025)
26% of BI-RADS 3 breast findings — 386 out of 1,511 patients — have no documented closure (AHRQ Lacson Report, 2025)
1 in 7 radiology recommendations never achieve documented follow-up (JAMA Network Open, 2022)
44% of diagnostic errors involve a failure to follow up on a lab or imaging result (Schiff et al., cited BMJ Open Quality 2023)
The follow-up and coordination phase is a cause or contributor in 46% of severe and fatal diagnostic adverse events (PMC, March 2025)

The problem isn't that clinicians don't care. It's that no system is watching the loop.

Consider this scenario: a BI-RADS 4 mammogram is filed. The radiologist moves on. The referring clinician never gets a callback. The patient never hears back. Sixty-three days pass. A suspicious breast finding remains unresolved, and the window for earlier diagnosis narrows. Nobody lied, nobody was negligent — the loop simply was never watched.

Commercial platforms like Rad AI Continuity ($40M+ funded), Inflo Health, and Vital Guard are racing to solve this for large health systems — but none of them work inside an AI agent platform via MCP. ResultLoop is an MCP-native implementation of closed-loop abnormal result follow-up auditing, built to run inside an AI agent platform rather than as a standalone enterprise workflow system.

This is distinct from HEDIS quality measure gap programs, which track whether preventive screenings were ordered. ResultLoop audits the step after a critical finding has already been returned — whether the abnormal result has documented follow-up or is silently sitting open in the chart.

What it does

ResultLoop is a SHARP-on-MCP clinical safety server that audits patient FHIR records for abnormal lab and imaging results lacking documented follow-up closure. It works inside Prompt Opinion: any agent can invoke it against a patient's live FHIR record to discover open loops, rank them by risk, and draft closure actions for clinician review.

14 tools across 4 workflow stages:

DISCOVER

find_recent_abnormal_results — scans Observations and DiagnosticReports for abnormal/critical/panic-value flags
get_result_followup_context — searches ServiceRequest, Task, Communication, CarePlan, Encounter for closure evidence; returns OPEN_LOOP or CLOSED
get_patient_loop_history — detects longitudinal closure-failure patterns across 24 months (repeat misses)

EVIDENCE

get_clinical_guideline — searches PubMed E-utilities (esearch → esummary) for real PMID-cited clinical guidelines by finding type (BI-RADS, Lung-RADS, CA-125, etc.); returns journal metadata and direct PubMed URLs; no patient data sent to NCBI

ANALYZE

rank_open_loops_by_severity — deterministic Closure Risk Index (CRI) using BI-RADS, Lung-RADS, HbA1c, eGFR, TSH, potassium, FIT, hemoglobin thresholds — no LLM touches the score
generate_result_closure_summary — structured clinical safety artifact with evidence matrix and per-result risk assessment
verify_closure_claim — hallucination guardrail; returns PASS / CONTRADICTION / INSUFFICIENT_EVIDENCE

ACT

draft_followup_task_resource — draft-only FHIR R4 Task + AuditEvent with SHA-256 hash; never writes to the chart
write_followup_task — writes FHIR Task + Provenance to the chart; requires explicit clinicianApprovalConfirmed: true in the same conversation turn — blocked by parameter-level validation, not just prompt
draft_outreach_message — multilingual patient outreach (English / Spanish / Hindi) with 21st Century Cures Act framing
generate_audit_report — self-contained HTML clinical safety report — donut chart, severity cards, evidence matrix
write_audit_report_document_reference — writes the HTML audit report as a permanent FHIR DocumentReference (LOINC 11503-0, base64 embedded, linked Provenance) — survives session expiry, readable by any downstream EHR
write_detected_issue — writes a FHIR DetectedIssue (CAREGAP / care gap) for each open-loop result; severity mapped from CRI score; gated by explicit clinician approval
send_sms_outreach — sends clinician-approved patient outreach SMS via Twilio Programmable Messaging; gated by explicit clinicianApprovalConfirmed: true in the same conversation turn; E.164 phone validation; English (en) only on current Twilio trial account; never sends autonomously

Safety by design: Action-oriented writes such as Task, DetectedIssue, and SMS outreach require explicit clinician approval in the same conversation turn. Audit report persistence requires a valid generated reportId and writes a DocumentReference + Provenance record. ResultLoop identifies and audits — it never diagnoses, never treats, never overrides clinical judgment.

How we built it

Runtime: TypeScript + Node.js on Express
MCP transport: Streamable HTTP (SHARP-on-MCP, ai.promptopinion/fhir-context extension declared in capabilities)
FHIR R4: reads Observation, DiagnosticReport, ServiceRequest, Task, Communication, CarePlan, Encounter; writes Task + Provenance + DocumentReference (LOINC 11503-0, base64 HTML audit report) + DetectedIssue (CAREGAP, severity=high/moderate/low). Task and DetectedIssue writes require explicit clinician approval; DocumentReference persistence requires a valid generated reportId and linked Provenance.
PubMed: NCBI E-utilities two-step pipeline (esearch → esummary); real PMIDs verified against live PubMed; no patient data transmitted to NCBI
Deterministic scoring: Closure Risk Index (CRI) — totalPriorityScore = severityScore + timelinessScore; severity classified from BI-RADS / Lung-RADS / lab thresholds (CRITICAL→90, HIGH→60, MODERATE→30, LOW→10); timeliness step-scored by days since result (≥45d→30, ≥30d→20, ≥14d→10); no LLM touches the score
Integrity: SHA-256 closure artifact hash on every audit; Provenance chain on every FHIR write
Patients: 5 synthetic FHIR transaction bundles (Maria Lopez, James Chen, Sarah Williams, Robert Johnson, Priya Patel) across oncology, nephrology, endocrinology, cardiology, pulmonology
Deployment: Render — live at https://resultloop-mcp.onrender.com/mcp
Twilio SMS: native fetch-based Twilio Programmable Messaging integration (no SDK); HTTP Basic auth; E.164 phone validation; 155-char GSM-7 hard cap (single segment on Twilio trial); same-turn clinician approval gate enforced at parameter level; English-only SMS delivery (ES/HI drafts generated but blocked at send-time on trial account)
Validation: 17 labeled test scenarios across 5 synthetic patients — 50 assertions, zero false closures, all safety gates verified

Challenges we ran into

Per-result closure isolation: A follow-up ServiceRequest for one condition must not count as closure evidence for a different abnormal result. Built a custom result-linkage resolver (result-linkage.ts) that deep-walks all reference values in a FHIR resource JSON and matches by result ID, combined with code-term matching on LOINC display names — so a Task linked to a mammogram DiagnosticReport does not count as closure for an unrelated CA-125 Observation. This is one of the hardest engineering problems in the domain — and a major reason funded commercial platforms invest heavily in closed-loop follow-up infrastructure.
Deterministic vs. LLM ranking: Initial severity ranking used LLM judgment, which varied run-to-run. Replaced with the deterministic CRI formula so ranking is reproducible, citable, and audit-safe.
Approval gate integrity: write_followup_task must never fire without explicit approval — enforced at both the system-prompt level and parameter level (clinicianApprovalConfirmed: boolean). Tested with an adversarial "write it anyway" prompt — correctly blocked.
Visual audit report: Generated a self-contained HTML report with live donut chart and severity cards from pure MCP tool output, served directly from the Express server — no frontend build step.

Accomplishments that we're proud of

50 assertions pass across 17 labeled test scenarios — open-loop detection, false-close resistance, FHIR write, multilingual outreach, safety gates, PubMed evidence retrieval, Twilio SMS delivery
Deterministic Closure Risk Index — a citable, auditable severity formula grounded in AHRQ published findings, equivalent to how Naranjo scoring works in pharmacovigilance
4 FHIR resource types written — Task, Provenance, DocumentReference (LOINC 11503-0, base64 HTML), DetectedIssue (CAREGAP) — action-oriented writes require explicit clinician approval, and persistence writes include linked Provenance
Permanent EHR-native audit record — the HTML audit report is embedded as base64 in a DocumentReference, survives server restarts, readable by any FHIR R4-compliant system
Tri-state verification — verify_closure_claim returns PASS / CONTRADICTION / INSUFFICIENT_EVIDENCE — a hallucination guardrail unique to our pipeline; a hallucinated closure is more dangerous than no answer
Multilingual outreach — drafts in English / Spanish / Hindi with 21st Century Cures Act framing; current Twilio trial delivery is English-only, with same-turn clinician approval gate enforced at the parameter level
Longitudinal failure detection — 24-month repeat-miss pattern surfaces patients who have fallen through the cracks before, not just the current open result
Concrete output example: for Maria Lopez, rank_open_loops_by_severity returns totalPriorityScore: 90 for the BI-RADS 4 mammogram (HIGH severity score 60 + ≥45 days timeliness score 30), ranked above the CA-125 elevation — deterministic, identical across every run

Benchmark

Eval corpus: 5 synthetic patients × 17 labeled scenarios × 50 assertions. Pipeline is fully deterministic — results are identical across runs.

Benchmark	Score
Open-loop detection (True Positive Rate)	100% — 11/11
False closure rate	0% — 0/11
Hallucination guardrail (`verify_closure_claim`)	100% — 3/3 verdicts correct
CRI severity ranking correctness	100% — 3/3 rank orders correct
FHIR write gate integrity	100% — blocked without approval, writes with approval
Multilingual outreach (EN / ES / HI)	100% — 3/3 languages correct
PubMed evidence retrieval	100% — real PMIDs verified; graceful no-results handling
Twilio SMS delivery + approval gate	100% — English SMS delivered to verified number (SMce30...); blocked without same-turn approval
Total	50/50 assertions pass

What we learned

The most dangerous moment in clinical AI is a confident wrong answer. The tri-state verification (PASS / CONTRADICTION / INSUFFICIENT_EVIDENCE) exists specifically because a hallucinated closure is more dangerous than no answer at all.
FHIR result linkage is the hardest part. Most commercial follow-up platforms cost millions because this is genuinely hard — determining which follow-up belongs to which result requires reasonReference, basedOn, focus, and temporal proximity, not just date matching.
Deterministic scoring beats LLM scoring for clinical triage. The CRI formula produces the same ranking every run. Judges, clinicians, and auditors can verify it. That reproducibility is what makes it trustworthy.
The $870B problem is real. OECD, AHRQ, and five funded startups all independently converged on the same problem. MCP is the missing interoperability layer that lets this capability work inside any agent workflow.

What's next for ResultLoop

rank_patients_by_loop_risk — multi-patient ward view; a charge nurse selects a department and sees every patient ranked by CRI in one call, not patient-by-patient
A2A agent layer — ResultLoop Auditor as a standalone A2A agent exposing verify_closure_claim as a public safety primitive; any agent on the Prompt Opinion platform can call it without owning the full audit pipeline — a cross-agent hallucination guardrail
False-positive analysis — expanded eval corpus with precision/recall breakdown per result type (BI-RADS, Lung-RADS, lab panels); current corpus is 100% TP / 0% FP but needs wider coverage before production deployment
EHR read-back — query the written DetectedIssue and Task resources post-session to confirm FHIR persistence and close the audit loop end-to-end

Judge verification

The uploaded judge setup package includes synthetic FHIR patient bundles, Prompt Opinion agent setup instructions, and testing output for the 17 validation scenarios.