PSE — Safety Substrate for Multi-Agent Healthcare AI

PSE is live in the marketplace now. Go check it out!
Part 1 of live demo MCP server and HSA agent
Part 2 of live demo MCP server and HSA agent
PSE insights view - a game changer
PSE resource view - get the full picture

Inspiration

Healthcare is rapidly adopting AI agents - a prescriber agent writes orders, a scheduler books procedures, a records-sync agent reconciles external data, an audit agent reviews logs. Each operates in isolation, with its own context, its own assumptions, its own blind spots. *That isolation is the new failure mode no one is designing for. * A prescriber agent doesn't see what the records agent just changed. A scheduler can't see the prescriber's allergy review. A silent records-sync overwrites a documented allergy because some external feed disagrees. The harm comes not from any single agent's mistake, but from the cascade none of them could see.

My inspiration was the layer underneath the agents. Every healthcare action should be intercepted by a single coordination and safety substrate that sees the patient's full state, the resources available, the recent verdicts from other agents and reasons about it before anything executes.

What it does

The Patient State Engine (PSE) is an MCP server that sits between every agent and the patient. Every proposed clinical action prescription, procedure scheduling, record update, lab order, discharge is routed through validate_action and returns a structured verdict:

APPROVED, REJECTED, CAUTION, or QUEUED_FOR_REVIEW
Risk score, confidence, full clinical reasoning
Concrete safer alternative when rejecting
Cost note when a generic-equivalent exists (e.g. warfarin $10/pack vs apixaban $450/pack)
Resource allocation naming a specific clinician, facility slot, and drug stock impact

Beyond per-action safety, PSE includes:

Resource pool: 9 clinicians, 7 facilities, 13 test types, 28 drugs with class + cost. Real allocation, real depletion.
Cross-patient insights: a second LLM pass surfaces patterns the per-action validator can't see like outbreak clusters, agent misbehaviour, cost drift, resource pressure.
Confidence-gated human review: low-confidence verdicts route to a clinician review queue, not auto-approved.
Adversarial-robust: catches brand-name allergen disguise, allergy stripping, prerequisite bypass, record-tamper via event log.
SHARP / FHIR-context native: declares ai.promptopinion/fhir-context with six SMART-on-FHIR scopes — Patient, Condition, AllergyIntolerance, MedicationRequest, Observation, Procedure.

How I built it

Engine: FastAPI server, in-memory state stores for 30 synthetic patients, clinicians, facilities, tests, pharmacy. Audit log persists to JSONL and re-hydrates on restart.
Reasoning: Pure-LLM, Claude Sonnet 4.6 via the Anthropic API. The system prompt enumerates the safety requirements (allergy cross-reactivity families, CKD dose review, procedural prerequisites, sensitive-record protection, resource feasibility, cost awareness, chain coordination). No rule engine.
MCP layer: FastMCP over streamable-HTTP at /mcp. Patches the initialize response so capabilities.extensions includes the ai.promptopinion/fhir-context declaration with required scopes. Honors X-Patient-ID, X-FHIR-Server-URL, X-FHIR-Access-Token headers per tool call.
Insights: separate LLM pass that takes the recent audit log + the cohort + the resource snapshot and returns structured outbreak / agent misbehaviour / cost drift / resource pressure signals.
Public reachability: Render primary + Cloudflare dev.
Frontend: single-file pse-demo.html. Five tabs (Story Mode, Live Demo, Resources, Review Queue, Insights). Story mode includes a "Without PSE / With PSE" toggle that runs the same actions twice for visceral contrast.
Adversarial agent: EvilPrescriberAgent with six attack classes, plus a dedicated demo script that probes the engine.

Challenges I ran into

Inverting my own architecture mid-build. Our original design was a hybrid rules-first + LLM-second pipeline. Mid-hack I tore the rule engine out and made the LLM the sole reasoner — better narrative, but it meant rewriting the system prompt to be airtight on every safety axis the rules used to handle.
MCP extension field discovery. Prompt Opinion expects the FHIR extension under capabilities.extensions, not the SDK's experimental bucket. Pydantic v2 silently dropped attribute-set extras. I had to inject directly via __pydantic_extra__ for the JSON serialisation to include it.
Resource ID drift. When asked to allocate clinicians, the LLM kept using human names ("Dr. Patel") instead of resource IDs ("DR-CARDIO-1"). Fixed with a forgiving matcher on the server side plus a tighter prompt.
Demo recording vs free-tier quotas. Mid-recording I hit the Gemini Free-tier daily quota in the platform agent's model. Solved by using another gemini account and previous-session footage for the chat beats and the HTML demo for the rest as suggested by Magnus and Pawan.
Cloudflare tunnel lifetime. Quick tunnels have no uptime guarantee. Resolved by migrating to Render free tier.

Accomplishments that I'm proud of

Cross-patient insights actually find things. A UTI outbreak cluster. A rogue agent attempting six unsafe actions across patients. Cost-drift on brand-vs-generic anticoagulants. These aren't seeded patterns * they emerged from real audit data the LLM reasoned over*.
Adversarial robustness. All six red-team attack classes are caught. Brand-name disguise (Augmentin = amoxicillin-clavulanate). Allergy-list stripping framed as an external feed sync. Prereq bypass with fake "ECG faxed externally" notes. All REJECTED.
The "Without PSE / With PSE" toggle. One click visualises the cascade failure and its prevention. The most-watched moment of the demo.
SHARP / FHIR-context done right. Six SMART scopes declared, all authorised via Prompt Opinion's user-consent flow, headers received and acknowledged per tool call.

What I learned

Multi-agent safety is a distinct failure mode. Not just "more bugs at scale" a genuinely new category that no single agent's pipeline can prevent. This convinced me the coordination substrate has to exist.
Pure-LLM reasoning works when the prompt enumerates the invariants. Sonnet 4.6 gets clinical safety questions right at >0.95 confidence when the system prompt is explicit about allergy classes, CKD dose review, sensitive-field protection, and cross-reactivity families. Vague prompts produced vague verdicts.
The audit log is more than a record — it's the input to the next LLM pass. Cross-patient pattern detection is just running the LLM over the audit + cohort. I didn't expect the patterns to be as sharp as they are.
SHARP extension declaration is elegant. Server declares scopes → user authorises per-scope → headers flow per request. A clean trust model.

What's next for Patient State Engine

Real FHIR backend integration. Replace the in-memory cohort with a patient-ID resolution layer that maps FHIR patient references to the engine's canonical clinical reasoning view.
A2A agent path. Expose the same engine as an A2A-enabled agent for Path B of the challenge ecosystem, so other agents can consult it via Agent-to-Agent rather than only via tool call.
Streaming verdicts. Token-by-token reasoning trace in the response so agents see the engine "thinking" instead of waiting for a full block.
Pluggable formulary / cost catalog. Cost data should come from real pharmacy benefit managers, not hand-coded constants. Same for clinician pool and facility calendar — wire to actual scheduling systems.
Production HIPAA hardening. Audit log encryption, no PII in response strings, configurable retention.
Replay mode. Feed an existing audit JSONL back into the UI for post-incident analysis — judge the engine on a historical dataset.
Multi-tenant deployment. Per-organisation cohorts, formularies, and policy thresholds (e.g. confidence gate).

Operational posture & deployment path

PSE ships as a Python service (FastAPI engine + FastMCP server). For the hackathon judging window I ran a tiered deployment:

Primary endpoint (durable): a Render web service deployed from this repository via the committed render.yaml. Stable HTTPS URL, auto-restart on crash, independent of any local machine survives laptop sleep, network changes, and operator absence.
Backup endpoint (development, fast iteration): a Cloudflare Quick Tunnel to a locally-running engine. Stands up a TLS-terminated public URL in seconds with no account required — useful for live debugging and verifying changes pre-deploy.

Render deployment specifics

Service configuration (see render.yaml at repo root):

Runtime: Python 3.11, free tier
Build: pip install -r requirements.txt
Start: bash start_render.sh — boots FastAPI on 127.0.0.1:8001 internally, polls /health until ready, then runs the MCP server bound to 0.0.0.0:$PORT (the public port Render injects). Internal REST stays local; only the MCP transport is publicly exposed.
Secrets: ANTHROPIC_API_KEY is set in the Render dashboard and never committed.
Cold start: Render's free tier spins the service down after 15 minutes of inactivity. First request after spin-down takes ~30 seconds; subsequent requests are warm. The Starter plan ($7/mo) eliminates cold starts not required for hackathon judging.

Cloudflare Quick Tunnel (backup) specifics

cloudflared tunnel --url http://localhost:8002 exposes the local engine for development. Tunnel life is bounded by the local process and network connectivity; Cloudflare openly notes Quick Tunnels carry no uptime guarantee. The local stack is hardened with nohup + disown on both server processes (REST on :8001, MCP on :8002) and macOS caffeinate -d -i -s to suppress display, idle, and system sleep.

Deployment scope

The architecture cleanly separates application logic from deployment glue:

$$ T_{\text{deploy}} \;\approx\; 15 \text{ to } 25 \text{ minutes}, \quad \Delta_{\text{code, application}} \;=\; 0 \text{ lines}, \quad \Delta_{\text{code, deploy-config}} \;=\; 3 \text{ files}. $$

The three deploy-config files (render.yaml, start_render.sh, and a one-line $PORT fallback in server/mcp_server.py) are committed (4973601) and visible to judges in the public repository.

Availability matrix

Surface	Submission	Stage One verify	Stage Two judging
Marketplace listing	✅	✅	✅
GitHub repository	✅	✅	✅
Demo video on YouTube	✅	✅	✅
HTML demo (`pse-demo.html`)	✅	✅	✅
MCP tool invocation — Render primary	✅	✅	✅ (≤50s cold start)
MCP tool invocation — Cloudflare backup	✅	✅	best-effort
`/insights` cross-patient LLM call	✅	✅	✅

The demo video documents every claimed capability end-to-end. With the Render deployment, AI Factor, Potential Impact, and Feasibility all score against a live, durable, publicly-reachable service — not a laptop-anchored demo.

Built With

cloudflare
css
fast-api
html
javascript
mcp
promptopinion
pydantic
python
render

Updates

Supriya Rai started this project — May 11, 2026 04:24 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.