Inspiration

50–60% of clinical follow-ups documented in physician notes — labs, referrals, imaging — are never completed. These aren't administrative oversights. They're patient safety gaps. A missed potassium recheck after a medication change can lead to a hospitalization. A screening mammogram delayed by two years can mean a cancer recurrence caught too late. I built Clinical Promise Keeper because I believe every commitment a physician makes in a clinical note should be tracked, verified, and acted on - automatically.

What it does

Clinical Promise Keeper is an MCP-powered healthcare AI agent that:

  1. Extracts clinical promises from physician notes — follow-up labs, referrals, imaging orders, medication rechecks — using a 5-pass AI pipeline with few-shot chain-of-thought prompting
  2. Verifies each promise against FHIR R4 records — multi-hop verification across ServiceRequest, Observation, Appointment, and DiagnosticReport resources to determine if each promise was kept, is pending, or was missed
  3. Flags clinical significance — AI-powered insight scoring (high/medium/low) that considers the patient's full clinical context (e.g., a potassium recheck is HIGH significance when the patient recently had a medication change affecting kidney function)
  4. Generates actionable FHIR Task resources — draft tasks ready for physician review and signature, turning identified gaps into concrete next steps
  5. Collaborates with other agents via A2A — consults the Clinical Order Assistant to draft validated lab orders with LOINC codes, CPT codes, clinical justification, and specimen requirements Both agents are published on the Prompt Opinion Marketplace, discoverable and invokable by any healthcare organization.

How I built it

Architecture:

  • MCP Server exposing 4 tools (extract_promises, check_promises, generate_tasks, get_promise_summary) via direct JSON-RPC handler
  • 5-Pass AI Pipeline: Extraction → Calibration → FHIR Verification → Clinical Insight → Narrative Summary
  • FHIR R4 Client with multi-hop verification across 5 resource types
  • Temporal Normalization Engine parsing expressions like "in 3 weeks," "by March," "quarterly," and "as needed"
  • A2A Multi-Agent Collaboration between Clinical Promise Keeper and Clinical Order Assistant

Tech stack:

  • TypeScript/Node.js runtime
  • Google Gemini AI (gemini-3.1-flash-lite-preview) via @google/genai SDK
  • FHIR R4 standard for healthcare interoperability
  • MCP (Model Context Protocol) for tool exposure
  • A2A (Agent-to-Agent) protocol for multi-agent collaboration
  • Google Cloud Run (us-central1) with min-instances=1
  • Prompt Opinion platform for agent hosting and marketplace distribution
  • SHARP Extension Specs for healthcare context headers

AI techniques:

  • Few-shot chain-of-thought prompting with 3 clinical note examples
  • Two-pass extraction with calibration (extract → validate against source text)
  • AI-powered clinical significance scoring with rule-based fallback
  • Confidence filtering (threshold: 0.3) to reduce false positives

Challenges I ran into

  1. MCP SDK session management: Prompt Opinion doesn't send mcp-session-id headers on subsequent requests, causing "Server not initialized" errors. I bypassed the MCP SDK transport entirely and built a direct JSON-RPC handler.
  2. Platform header mapping: Prompt Opinion sends x-inc-sd instead of the expected x-patient-id header. Required reverse-engineering the platform's header conventions and building a mapping layer.
  3. Gemini model availability: The gemini-3.1-flash-lite-preview model returned 404 errors with the @google-cloud/vertexai SDK in us-central1. Resolved by switching to the @google/genai unified SDK with location: "global".
  4. Extraction accuracy: Initial extraction F1 was 57.1% — many false positive/negative pairs were the same promise described differently. Improved fuzzy matching with clinical abbreviation normalization and class-flexible mode, reaching 81.2% F1.
  5. Agent-to-tool communication: MCP SDK strips HTTP headers during JSON-RPC processing, preventing FHIR context from reaching tool handlers. Solved by passing headers directly through the tool call chain.

Accomplishments that I'm proud of

  • 81.2% F1 score validated against 21 clinical notes across 8 medical specialties (primary care, cardiology, oncology, endocrinology, surgery, psychiatry, emergency medicine, nephrology)
  • 84.4% recall — catches 5 out of every 6 missed clinical commitments
  • 100% accuracy in oncology — where missed follow-ups carry the highest patient safety risk
  • Two published agents on the Prompt Opinion Marketplace with working A2A collaboration
  • Production deployment on Google Cloud Run with health checks, metrics dashboard, and clinical disclaimers
  • Building a complete FHIR-native MCP pipeline that goes from unstructured physician notes to structured, actionable FHIR Task resources

What I learned

  • The MCP ecosystem is still maturing - platform implementations vary significantly, and building production-grade MCP servers requires workarounds that aren't documented yet
  • A2A (Agent-to-Agent) protocol enables genuinely useful multi-agent workflows in healthcare - having a specialized order-drafting agent collaborate with a gap-detection agent produces better output than either alone
  • Clinical NLP is hard - the same promise can be expressed dozens of ways across specialties, and confidence calibration is essential to avoid alert fatigue
  • FHIR R4's resource model maps surprisingly well to the concept of "clinical promises" - ServiceRequests, Observations, and Appointments naturally represent commitments and their fulfillment

What's next for Clinical Promise Keeper

  • EHR integration: Direct integration with Epic and Cerner via SMART on FHIR launch context
  • Longitudinal tracking: Track promises across multiple visits, not just single notes
  • Alert prioritization: Machine learning model trained on actual patient outcomes to rank promise urgency
  • Batch processing: Analyze entire patient panels to identify population-level care gaps
  • Specialty-specific models: Fine-tuned extraction for oncology, cardiology, and other high-risk specialties where our validation showed the highest impact
  • HIPAA compliance: This tool is designed as a clinical decision support system. All outputs are AI-generated recommendations that require clinician review before action. No protected health information (PHI) is stored or transmitted beyond the immediate processing context.

Built With

Share this project:

Updates