Inspiration

Every year, 4.2 million people die within 30 days of surgery — more than HIV, TB, and malaria combined (Lancet 2019). The pattern shows in the vital trends 3–6 hours before the crash, but ward nurses watch eight patients across twelve-hour shifts. The danger isn't a single threshold being crossed; it's the slow, multivariate drift across SBP, HR, RR, urine output, and what the bedside nurse writes in the chart.

Two complementary papers landed the architectural direction:

  • TREWS (Adams et al, Nature Medicine 2022) — prospective implementation of a machine-learning early-warning system across five hospitals, achieving an 18% relative reduction in in-hospital mortality with a 5.7-hour median lead time before threshold.
  • COMPOSER-LLM (npj Digital Medicine, May 2025) — prospective trial of an LLM extracting sepsis signals from unstructured nursing notes alongside a deterministic detector: 72.1% sensitivity, 52.9% PPV, 0.0087 false alarms per patient-hour.

Vigil is what happens when you take those two ideas — predictive early-warning + LLM enrichment of free-text nursing observations — and make them invocable as an A2A agent any clinician can consult on Prompt Opinion's marketplace.

What it does

Vigil is a postoperative and postpartum sentinel agent with 18 A2A skills covering the full clinical-deterioration arc. The agent ships under the Path B Option 3 model — Independent A2A Agent — registered on Prompt Opinion's Marketplace.

Clinical screens (8 skills): MEWT vital thresholds, qSOFA + composite risk, CDC Adult Sepsis Event, NEWS2 (RCP 2017), KDIGO-staged AKI, CMQCC postpartum hemorrhage staging, drug-vs-physiology safety, and RCPCH/Monaghan paediatric early-warning (PEWS).

Composite + AI-only skills (4):

  • draft_sbar — multi-modal SBAR fusing structured vitals + labs + free-text nursing notes into a copy-paste-ready handoff with an LLM-ranked differential
  • read_nursing_signals — LLM extracts subjective deterioration phrases ("patient feels off", "increasingly restless") that vital-sign threshold engines literally cannot read
  • forecast_trajectory — least-squares regression projecting time-to-threshold-breach per vital with 95% CI, mapped to TREWS lead-time
  • explain — conversational follow-up that answers free-text clinical questions with cited guideline anchors, never deciding escalation

Workflow + ops (6): autonomous monitoring loop, on-demand cycle trigger, review-queue digest, per-patient watch status, hypothetical ROI estimation, and active-learning feedback recording.

Every clinical verdict cites the named guideline behind it (Subbe MEWS 2001, qSOFA Sepsis-3 JAMA 2016, RCP NEWS2 2017, KDIGO 2012, CMQCC v3.0, CDC ASE, RCPCH PEWS, Surviving Sepsis Campaign 2021) and carries a confidence tag (HIGH / MEDIUM / LOW) with a stated reason — TRIPOD+AI 2024 compliance for free.

How we built it

Architecture invariant from day one: deterministic rules first, LLM second. The rule engines (in backend/criteria/) drive every escalation decision. The LLM only enriches narrative, extracts free-text signals, projects trajectories, and ranks differentials. This invariant is verified in code — the A2A agent has no FHIR write capability whatsoever. Only the FastAPI proxy's approve_alert flow writes, and only after a clinician's POST.

The backend stack:

  • Python 3.11 + FastMCP (MCP tool layer) + a2a-sdk (A2A agent shell) + FastAPI (clinician-approve proxy)
  • Pluggable LLM provider via LLM_PROVIDER env (Ollama / Groq / Anthropic Claude / Google Gemini) — Llama-3 8B vs Mixtral comparable per AMIA 2024
  • HAPI FHIR R4 v7.2.0 + Postgres for the embedded clinical store
  • SHARP context propagation per Prompt Opinion's spec — the agent reads from whichever FHIR server PO injects per request

The trust layer (Phase 3 commit): Every clinician-approved FHIR write emits a four-resource attestation chain:

  1. Device (Vigil agent) with version from the AgentCard and a SHA-256 of the AgentCard JSON stamped as a property
  2. Communication with US Core profile
  3. Provenance linking the Communication to the Device (author) + Practitioner (verifier), with the AgentCard hash carried in signature.data under custom sigFormat application/x-vigil-agent-card-sha256
  4. AuditEvent per FHIR R4 profile

The validation layer (Phase 4): tests/validation/ ships two pytest harnesses with TRIPOD+AI 2024 reporting compliance:

  • A baseline harness asserts sensitivity ≥ 0.80, specificity ≥ 0.80, mean lead time ≥ 1 timepoint
  • A comparative harness pits Vigil's combined screen against NEWS2-only and qSOFA-only on the same synthetic cohort. Headline result: Vigil sensitivity 1.000, specificity 1.000, mean lead time 3 timepoints1.4 timepoints (~1.4h) ahead of either single-rule baseline.

Deployment: AWS EC2 c7i-flex.large + docker-compose with five services (HAPI, MCP, A2A, API proxy, Caddy reverse proxy) behind auto-TLS via Let's Encrypt. GitHub Actions deploys on every push to main. One-shot seeder service re-anchors the synthetic cohort's observation timestamps to now on every container start.

The result, end-to-end: 654 passing pytest items · ruff-clean · multi-LLM provider swap · 18 A2A skills enumerated on the live AgentCard at https://13-238-216-196.sslip.io/.well-known/agent-card.json · v1.0.0 tagged with GitHub Release notes.

Challenges we ran into

1. PO's chat agent threading. Prompt Opinion's General Chat agent reuses the same A2A task ID across multiple skill invocations in the same conversation. Once Vigil emitted TaskState.completed for the first skill, every follow-up call landed on a now-terminal task and got rejected with "Task is in terminal state: completed" — which is the spec-correct A2A behaviour. Fixed by adding a small middleware (PoCompatMiddleware._strip_stale_task_ids) that strips every taskId reference from inbound payloads, so Vigil's DefaultRequestHandler mints a fresh task per request. Spec-non-strict but necessary for the only A2A client we target.

2. FHIR-import compatibility. Our seeded nursing-note text was originally attached to Observation.note arrays. Prompt Opinion's data-import pipeline silently strips inline Observation.note on ingest, so read_nursing_signals returned "no notes available" on PO-uploaded patients. Fixed by emitting nursing notes also as standalone FHIR DocumentReference resources (LOINC 11506-3, US Core clinical-note category, base64-encoded text). Importers can drop one path or the other, not both. Production EHRs use DocumentReference for free-text notes anyway, so this is the more portable representation.

3. The autonomous-loop architectural compromise. PO's marketplace is pull-based per chat invocation — there's no callback channel into an inactive PO chat thread, so Vigil's autonomous monitoring can't tick against PO's FHIR. We resolved this by giving Vigil its own embedded HAPI server with a synthetic seeded cohort. The loop ticks against that cohort; per-request skills (screen_vitals, read_nursing_signals, etc.) read PO's FHIR via SHARP. Two surfaces, one agent — explicit in the chat reply via the vigil.list_recent_alerts footnote.

4. Skill routing without overlap. With 18 skills and overlapping clinical vocabulary, the keyword router needed careful ordering. The bare word kid once substring-matched kidney, routing AKI queries to the paediatric screen. Bare help would hijack help me check sepsis. Multi-word phrases and explicit ordering (specific → generic) fixed both.

5. AI Factor without violating the safety invariant. The hackathon rewards generative-AI ambition, but the clinical-safety story rejects autonomous escalation. We resolved this by adding additive AI capabilities — nursing-note NLP, predictive forecasting with confidence intervals, conversational explanation, LLM-ranked differentials — every one of which can be silently disabled and the deterministic core still produces a valid clinical reply. The LLM is a layer on top, never the verdict source.

Accomplishments that we're proud of

  • Eighteen A2A skills spanning the full clinical-deterioration arc — postop, postpartum, paediatric, sepsis, AKI, hemorrhage, drug safety — all citing named published guidelines inline.
  • Comparative validation showing Vigil beats NEWS2-only and qSOFA-only by 1.4 hours on the synthetic cohort lead time, with the assertion gated into pytest.
  • A four-resource FHIR attestation chain (Device + Communication + Provenance + AuditEvent) where every Communication carries the AgentCard SHA-256 in the Provenance signature — production-grade authorship transparency that's rare in healthcare-AI hackathon entries.
  • TRIPOD+AI 2024 compliance. Items 16 (interpretability via Shapley-style attribution), 19 (uncertainty via confidence tags + 95% CI), 22 (performance metrics in the validation harness), and 26 (intended use + version in Provenance writes) are all implemented.
  • A defensible AI-Factor thesis anchored in May-2025 peer-reviewed literature (COMPOSER-LLM in npj Digital Medicine, prospective LLM-vs-LLM comparison at AMIA 2024) rather than vendor decks.
  • 654 passing tests including unit, integration, two-tier validation, and route-level dispatch coverage. Ruff-clean. One pre-existing flake (test_llm_cache_ttl_expiry) documented and unrelated.
  • A v1.0.0 GitHub Release with full submission notes and a signed tag.
  • The clinician-in-the-loop boundary is verified in code. A grep of the A2A agent for FhirClient.post / FhirClient.put returns zero results.

What we learned

  • Deterministic-first is the right shape for clinical AI, even at the cost of an AI-Factor point. Mathur (Cleveland Clinic intensivist on the judging panel) is the heaviest authority in the room on this question, and the architecture matches what the published clinical-AI literature actually wants.
  • A2A protocol fidelity matters for marketplace discovery. The AgentCard at /.well-known/agent-card.json is the de facto manifest. Get it right and PO's runtime auto-discovers; get one field wrong and the agent's invisible.
  • SHARP context propagation works. Three HTTP headers (x-fhir-server-url, x-fhir-access-token, x-patient-id) carry per-request FHIR auth without a custom session layer. The pattern scales naturally to any A2A-aware client.
  • LLM enrichment is genuinely additive when the deterministic core is verbose. Vigil's score_risk chat reply has both the rule-derived rationale (qSOFA 2/3, MEWT 4 breaches) and a 1-2 sentence LLM patient-context interpretation. The clinician gets both; if the LLM call fails, the rule output still lands.
  • PO's chat agent is a layer, not a passthrough. It paraphrases consultee replies aggressively. Designing the structured A2A reply assuming PO will re-render means the careful clinical formatting must survive that paraphrase — which we addressed by leading every reply with a severity badge and citing guidelines inline so PO's summary can't drop them.
  • Synthetic cohorts validate the architecture, not the medicine. Our headline sensitivity 1.000 / specificity 1.000 numbers are illustrative for the hackathon; real validation needs MIMIC-IV or eICU-CRD via PhysioNet credentialing post-submission. The validation harness is the right shape for that day.

What's next for Vigil — Postop & Postpartum Sentinel

Near-term (post-hackathon, 1–3 months):

  • Real prospective validation against MIMIC-IV via PhysioNet credentialing — port the comparative harness to a real cohort with statistical-significance bands.
  • Clinical advisor on record — recruit an attending intensivist or obstetrician co-investigator. The single most leverage-y move from a feasibility-judge perspective.
  • SMART-on-FHIR agent-side authorization (replaces the interim API-key gate). Dynamic client registration + scoped token issuance per workspace.

Mid-term (3–9 months):

  • 510(k) pre-submission package (Q-Sub) referencing TREWS (K-cleared) and COMPOSER-LLM as predicates for Class II SaMD classification.
  • SQLCipher for the review-queue at-rest encryption; production BAA chain with cloud and LLM providers.
  • Pilot deployment at a partner hospital with IRB approval — single ward, single trajectory (postop), 90-day outcome metrics.
  • MIMIC-IV-derived fixtures replacing the synthetic seed cohort for the open-source distribution, with full provenance.

Longer-term (9–18 months):

  • Cross-patient cohort intelligence"three of your postop patients today show similar lactate trajectories — is there a unit-wide issue?" The signal is genuinely novel and rule-engines can't produce it.
  • Closed-loop active learning — chat-side vigil.feedback already collects labelled examples; an offline MLOps pipeline can route them into rule-pack tuning and LLM prompt versioning with proper experiment tracking.
  • EHR integration receipts with Epic, Cerner, MEDITECH — months of vendor coordination, but the SHARP context propagation pattern Vigil ships against is the same one those vendors implement under their SMART-on-FHIR layers.

Vigil's deployment posture is built for incremental hardening, not a rewrite. The hackathon submission is the v1.0.0 of what we plan to keep building.

Built With

Share this project:

Updates