TriageFlow

Multi-agent clinical triage that reasons like a clinician — including the patterns of patients medicine has historically missed.

Built on Prompt Opinion's Agent-to-Agent (A2A) infrastructure for the Agents Assemble: Healthcare AI Endgame hackathon. Track: A2A Agent.

Inspiration

Chest pain in women is the most-missed serious diagnosis in primary care.

Women presenting with pulmonary embolism are diagnosed later than men. Women presenting with myocardial infarction are more likely to be told it's anxiety. Perimenopausal women, in particular, present with "atypical" symptoms — pressure, fatigue, breathlessness — that don't match the textbook male-pattern crushing chest pain. The result is a measurable gap in time-to-diagnosis, time-to-imaging, and time-to-treatment along sex lines.

This is not a model bias problem. It is a workflow problem. Triage tools encode the patterns the literature was built on, and the literature was built on men.

TriageFlow encodes sex-specific clinical reasoning into the triage layer itself — and crucially, it does so as a deployable architectural commitment, not a clever prompt.

What it does

TriageFlow is a four-agent clinical workflow:

Agent Role
Triage Reasoner Entry point. Reads patient FHIR data plus QuestionnaireResponse answers. Computes Wells PE score and HEART score with explicit point-by-point arithmetic. Recommends one of four dispositions: ED, Specialist, GP, Discharge.
ED Handoff Coordinator Generates SBAR-format emergency department handoff packets for the receiving emergency physician.
Referral Letter Generator Generates formal specialist referral letters with a single-sentence Clinical Question and specific answerable questions for the consultant.
Patient Education Generator Generates plain-language take-home content at 6th–8th grade reading level with explicit return-to-ED criteria.

The agents are wired via A2A. When Triage Reasoner recommends an ED disposition, it programmatically constructs a structured handoff payload — patient ID, working diagnosis, Wells score, key findings — and invokes the ED Handoff Coordinator. The same flow handles Specialist Referral and GP/Discharge paths.

The clinical case it handles

Demo patient: Maria Elena Reyes, 47-year-old perimenopausal woman, chest pain workup. Two variants of patient-reported answers produce two completely different paths through the same four-agent system.

Variant A — high-risk: Sharp pleuritic pain, right calf soreness for 3-4 days, recent 15-hour flight, recent tooth-extraction with bedrest, on combined OCPs, elevated D-dimer 720 ng/mL, normal troponin, normal ECG. Patient-reported answers from the QuestionnaireResponse are treated as authoritative clinical data. Triage Reasoner computes Wells 7.5 (high probability), recommends ED, hands off to ED Handoff Coordinator. Output: SBAR packet with explicit perimenopausal atypical-presentation framing for the receiving emergency physician.

Variant B — lower-risk: Tightness rather than pleuritic pain, no calf symptoms, no immobility concerns, no clotting history. Triage Reasoner computes Wells 0 (low probability), recommends GP follow-up, hands off to Patient Education Generator. Output: plain-language patient education content. "Your pain is real." Specific return-to-ED criteria.

Same patient. Same architecture. Two paths. That is the value of agentic triage over a static decision tree.

How we built it

Platform: Prompt Opinion. Each agent is configured via BYO Agents with a system prompt, A2A skill registration, and FHIR Context Extension enabled.

Data layer: Patient FHIR bundle (Patient, Encounter, Observations for vitals + labs + ECG, Conditions for perimenopause + HTN, MedicationStatements for OCP + Lisinopril, DocumentReference for the triage intake note) staged in Po's FHIR server. QuestionnaireResponse uploaded via SMART-on-FHIR system/Patient.read system/QuestionnaireResponse.write scopes through OAuth client_credentials flow.

Reasoning layer: Each agent has a structured system prompt with explicit data-integration checklists, output formats, and critical constraints. The Triage Reasoner enforces a Data Integration Checklist section before any score is computed — forcing the agent to enumerate which Observations, Conditions, Medications, and QR answers it found, then cite each Wells criterion to its source.

Sex-specific reasoning: Baked directly into the prompts. "When QR answers are present, they OVERRIDE absence-of-finding assumptions in the FHIR record." Patient-reported symptoms are treated as first-class clinical data.

A2A wiring: Each downstream agent registers an A2A skill (ed-handoff-coordinator, referral-letter-generator, patient-education-generator) with FHIR Context Extension. The Triage Reasoner constructs structured handoff payloads, and the receiving agent reads the FHIR context, patient record, and handoff payload to produce its output.

Architectural pattern: data-source independence

The Triage Reasoner's system prompt accepts patient-reported QuestionnaireResponse data from EITHER FHIR tool retrieval OR inline prompt injection — both paths are treated as equivalent first-class clinical data sources.

This is deliberate. Real clinical AI deployments rarely have a single canonical data path. Patient-reported data arrives via intake tablets that write to FHIR, via care coordinator notes, via referral letters with embedded structured forms, via mobile app responses synced asynchronously, and increasingly via voice-AI intake workflows. A clinical reasoning agent that depends on exactly one tool's exactly one query path is brittle in any real deployment.

TriageFlow's prompt-level design — "patient-reported answers may arrive via FHIR tools OR inline; treat both as authoritative" — makes the agent portable across integrating workflows. The same Triage Reasoner produces identical Wells score reasoning whether QR data is fetched by GetPatientData, fetched by a custom MCP server, attached to a multi-agent A2A payload, or written into the user prompt by an upstream intake system. What matters clinically is provenance and authority — which the audit substrate captures via FHIR resource IDs and QR linkIds in every output. The retrieval path is plumbing.

Safety substrate and Feasibility

TriageFlow is designed to be deployable, not just demoable.

Deterministic clinical math, never LLM math. The Wells PE score and HEART score are reference-implemented as wells_scorer.py, a deterministic Python module published in the repository. The LLM-based Triage Reasoner extracts structured evidence — calf signs (boolean), recent surgery (boolean), HR > 100 (boolean) — from FHIR observations and QR answers. The scorer computes the math. No autonomous LLM arithmetic in the critical path. The scorer also encodes a safety floor that escalates to ED on isolated DVT signs regardless of other inputs. Three passing unit tests reproduce Variant A → Wells 7.5 → ED, Variant B → Wells 0 → GP, and the safety floor enforcement.

Audit substrate by default. Every output traces every claim to a FHIR resource ID or QuestionnaireResponse linkId. Not a clinical readability feature — the audit substrate. Example from a live output: "Clinical signs of DVT: 3.0 pts (Source: QR Q3 — patient reports right calf soreness for 3-4 days following a long flight)." A regulator, malpractice attorney, or clinical informaticist auditing a disposition decision can trace it from action back to source data within seconds.

FDA 21st Century Cures Act §3060 alignment. TriageFlow is designed to qualify as non-device CDS: not analyzing medical images or signals; displaying analyzed clinical info; intended for healthcare providers (not direct patient action); and crucially — independent review enabled. Every recommendation cites the underlying FHIR resource. A clinician can independently review the basis for any recommendation in seconds, by design.

Clinician-in-the-loop architecture. Triage Reasoner emits a recommendation, not a decision. ED Handoff emits a draft for sign-off. Referral Letter emits a draft for clinician approval. Patient Education emits content subject to clinician approval before patient release. The clinician is not in the loop because we added a checkbox — the architecture assumes their final action.

Least-privilege scopes + synthetic data. OAuth client uses only system/Patient.read + system/QuestionnaireResponse.write. No blanket write access. Synthetic patient data only. No PHI.

Real-world adoption context. Government hospitals in eastern India are actively piloting LLM-based clinical decision-support systems. TriageFlow's architectural pattern — deterministic safety substrate plus LLM-extracted evidence plus full audit trail plus clinician final authority — fits the deployment shape these pilots are converging on. This is not theoretical infrastructure.

Challenges

FHIR API discovery. Prompt Opinion's FHIR base URL pattern (/api/workspaces/{workspace_id}/fhir) is not the conventional SMART-on-FHIR pattern. Significant debugging of "non-JSON responses" before finding the URL in Settings → General → URLs.

Scope discovery for QuestionnaireResponse upload. The default po_fhir scope alone doesn't grant resource-level FHIR permissions. The fix: explicit SMART-on-FHIR scopes (system/Patient.read system/QuestionnaireResponse.write) on the client_credentials request.

Agent tool limitations vs prompt design. The default GetPatientData tool fetches Observations, Conditions, and MedicationStatements — but not QuestionnaireResponse resources. Rather than wait for tool changes, the prompts are designed to be resilient: QR data may arrive via FHIR tools OR inline in the user prompt. Either way, treat as authoritative. This is the data-source independence pattern described above.

Sex-specific reasoning under workflow pressure. The hard part wasn't getting the agent to compute Wells — it was getting it to NOT skip the calf-symptom point when the only evidence was a patient-reported QR answer rather than a structured Observation. Fix: an explicit prompt clause stating that QR answers OVERRIDE absence-of-finding assumptions. Without it, the agent defaulted to "calves non-tender: 0 pts." With it, the agent correctly awards 3.0 points and cites the QR answer as the source — flipping Wells from 4.5 to 7.5 and changing time-to-CT-PA decisions.

What we learned

  • Multi-agent systems aren't smarter than single agents — they're more inspectable. A clinician can audit which agent was invoked, why, and what it received. That's hard to achieve with a single monolithic prompt.
  • Prompt engineering for clinical reasoning is mostly forcing structure, not adding cleverness. "Output a Data Integration Checklist before any score" did more for output quality than any rewording of clinical instructions.
  • Sex-specific reasoning is a prompt-engineering problem as much as a data problem. Once you tell the agent that patient-reported answers OVERRIDE absence-of-findings — and that perimenopausal women present atypically — the reasoning shifts dramatically without changing models, without adding examples, without finetuning.

What's next

  • Extend Triage Reasoner to additional chief complaints (abdominal pain, dyspnea, headache)
  • Clinician feedback loop: capture which dispositions clinicians override and why
  • Outcome tracking: connect to follow-up data (was the PE actually confirmed? Did the referral go through?)
  • Other underdiagnosed populations: elderly atypical sepsis, pediatric Kawasaki disease

We didn't build a chatbot. We built clinical infrastructure that respects how real clinicians think — and how real patients present, including the ones medicine has historically been worst at hearing.

Built With

  • a2a-protocol
  • agent-to-agent
  • clinical-decision-support
  • fda-cures-act
  • fhir-r4
  • gemini
  • healthcare-ai
  • heart-score
  • multi-agent-systems
  • oauth-2.0
  • prompt-opinion
  • python
  • sbar-handoff
  • sharp-extension
  • smart-on-fhir
  • wells-score
  • womens-health
Share this project:

Updates