Clinical Promise Keeper

Inspiration

50–60% of clinical follow-ups documented in physician notes — labs, referrals, imaging — are never completed. These aren't administrative oversights. They're patient safety gaps. A missed potassium recheck after a medication change can lead to a hospitalization. A screening mammogram delayed by two years can mean a cancer recurrence caught too late. I built Clinical Promise Keeper because I believe every commitment a physician makes in a clinical note should be tracked, verified, and acted on - automatically.

What it does

Clinical Promise Keeper is an MCP-powered healthcare AI agent that:

Extracts clinical promises from physician notes — follow-up labs, referrals, imaging orders, medication rechecks — using a 5-pass AI pipeline with few-shot chain-of-thought prompting
Verifies each promise against FHIR R4 records — multi-hop verification across ServiceRequest, Observation, Appointment, and DiagnosticReport resources to determine if each promise was kept, is pending, or was missed
Flags clinical significance — AI-powered insight scoring (high/medium/low) that considers the patient's full clinical context (e.g., a potassium recheck is HIGH significance when the patient recently had a medication change affecting kidney function)
Generates actionable FHIR Task resources — draft tasks ready for physician review and signature, turning identified gaps into concrete next steps
Collaborates with other agents via A2A — consults the Clinical Order Assistant to draft validated lab orders with LOINC codes, CPT codes, clinical justification, and specimen requirements Both agents are published on the Prompt Opinion Marketplace, discoverable and invokable by any healthcare organization.

How I built it

Architecture:

MCP Server exposing 4 tools (extract_promises, check_promises, generate_tasks, get_promise_summary) via direct JSON-RPC handler
5-Pass AI Pipeline: Extraction → Calibration → FHIR Verification → Clinical Insight → Narrative Summary
FHIR R4 Client with multi-hop verification across 5 resource types
Temporal Normalization Engine parsing expressions like "in 3 weeks," "by March," "quarterly," and "as needed"
A2A Multi-Agent Collaboration between Clinical Promise Keeper and Clinical Order Assistant

Tech stack:

TypeScript/Node.js runtime
Google Gemini AI (gemini-3.1-flash-lite-preview) via @google/genai SDK
FHIR R4 standard for healthcare interoperability
MCP (Model Context Protocol) for tool exposure
A2A (Agent-to-Agent) protocol for multi-agent collaboration
Google Cloud Run (us-central1) with min-instances=1
Prompt Opinion platform for agent hosting and marketplace distribution
SHARP Extension Specs for healthcare context headers

AI techniques:

Few-shot chain-of-thought prompting with 3 clinical note examples
Two-pass extraction with calibration (extract → validate against source text)
AI-powered clinical significance scoring with rule-based fallback
Confidence filtering (threshold: 0.3) to reduce false positives

Challenges I ran into

MCP SDK session management: Prompt Opinion doesn't send mcp-session-id headers on subsequent requests, causing "Server not initialized" errors. I bypassed the MCP SDK transport entirely and built a direct JSON-RPC handler.
Platform header mapping: Prompt Opinion sends x-inc-sd instead of the expected x-patient-id header. Required reverse-engineering the platform's header conventions and building a mapping layer.
Gemini model availability: The gemini-3.1-flash-lite-preview model returned 404 errors with the @google-cloud/vertexai SDK in us-central1. Resolved by switching to the @google/genai unified SDK with location: "global".
Extraction accuracy: Initial extraction F1 was 57.1% — many false positive/negative pairs were the same promise described differently. Improved fuzzy matching with clinical abbreviation normalization and class-flexible mode, reaching 81.2% F1.
Agent-to-tool communication: MCP SDK strips HTTP headers during JSON-RPC processing, preventing FHIR context from reaching tool handlers. Solved by passing headers directly through the tool call chain.

Accomplishments that I'm proud of

81.2% F1 score validated against 21 clinical notes across 8 medical specialties (primary care, cardiology, oncology, endocrinology, surgery, psychiatry, emergency medicine, nephrology)
84.4% recall — catches 5 out of every 6 missed clinical commitments
100% accuracy in oncology — where missed follow-ups carry the highest patient safety risk
Two published agents on the Prompt Opinion Marketplace with working A2A collaboration
Production deployment on Google Cloud Run with health checks, metrics dashboard, and clinical disclaimers
Building a complete FHIR-native MCP pipeline that goes from unstructured physician notes to structured, actionable FHIR Task resources

What I learned

The MCP ecosystem is still maturing - platform implementations vary significantly, and building production-grade MCP servers requires workarounds that aren't documented yet
A2A (Agent-to-Agent) protocol enables genuinely useful multi-agent workflows in healthcare - having a specialized order-drafting agent collaborate with a gap-detection agent produces better output than either alone
Clinical NLP is hard - the same promise can be expressed dozens of ways across specialties, and confidence calibration is essential to avoid alert fatigue
FHIR R4's resource model maps surprisingly well to the concept of "clinical promises" - ServiceRequests, Observations, and Appointments naturally represent commitments and their fulfillment

What's next for Clinical Promise Keeper

EHR integration: Direct integration with Epic and Cerner via SMART on FHIR launch context
Longitudinal tracking: Track promises across multiple visits, not just single notes
Alert prioritization: Machine learning model trained on actual patient outcomes to rank promise urgency
Batch processing: Analyze entire patient panels to identify population-level care gaps
Specialty-specific models: Fine-tuned extraction for oncology, cardiology, and other high-risk specialties where our validation showed the highest impact
HIPAA compliance: This tool is designed as a clinical decision support system. All outputs are AI-generated recommendations that require clinician review before action. No protected health information (PHI) is stored or transmitted beyond the immediate processing context.

Built With

a2a
cloudrun
docker
fhir
gemini
google
mcp
node.js
promptopinion
sharpextensionspecs
typescript

Updates

Private user started this project — Mar 22, 2026 12:00 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.