Neurophrax — A medication-safety reasoning layer for healthcare AI agents

Inspiration

Medication errors hospitalize over a million Americans every year, and the AHRQ estimates more than half are preventable. The safety check that should catch them — is this combination an interaction? a Beers violation? a duplicate antithrombotic class? a boxed warning? — lives across seven different databases. No clinician opens all seven, and no AI agent can reason across them without hallucinating.

We built Neurophrax to be the missing reasoning layer.

What it does

Neurophrax is an MCP server that gives any healthcare AI agent a single call for multi-source medication safety synthesis. It cross-references seven authoritative sources — DDInter 2.0 (~160k drug-drug interactions), AGS Beers Criteria 2023, the WHO ATC therapeutic class hierarchy, FDA boxed warnings, RxClass, OpenFDA labels, and the patient's live FHIR chart — and returns a prioritized risk list plus a FHIR RiskAssessment resource ready to persist back to the EHR.

This is the AI Factor: without a layer like this, an AI agent doing clinical reasoning either hallucinates drug facts or has to orchestrate a dozen API calls and reconcile their formats — neither is safe. With Neurophrax, the agent calls one tool and gets ground-truth multi-source synthesis it can build a recommendation on.

Examples of what the synthesis catches that single-source tools miss:

Warfarin + apixaban on the same chart → both are antithrombotics in ATC class B01A, even though they're in different L4 chemical subgroups (B01AA vs B01AF). DDInter alone doesn't flag this as a "duplicate," because they're different molecules. Beers alone doesn't flag it. Only class-level synthesis catches it.
Diphenhydramine in a 67-year-old → Beers 2023 flags this anticholinergic for fall and delirium risk. Cross-referenced against the patient's documented fall history, severity escalates.
Ciprofloxacin in a warfarin patient with G6PD deficiency → Three independent signals stacked: FDA boxed warning (tendinopathy/CNS in elderly), known interaction (raises INR), and G6PD-deficient population risk (hemolysis). One tool call surfaces all three with citations.

The MCP exposes nine tools to AI agents:

prescription_safety_brief — the composer that ties it all together (safety review + PubMed evidence per finding + active trials + ACIP vaccines)
medication_safety_review — pairwise interactions, Beers flags, duplicate therapeutic class, boxed warnings, pregnancy flags; prioritized by severity
drug_interaction_check, medication_profile, patient_summary — primitives the agent reaches for when the question is narrower
clinical_evidence_search (NCBI PubMed E-Utilities, filterable by study type and age), clinical_trial_search (ClinicalTrials.gov v2)
vaccine_recommendations (CDC ACIP), symptom_assessment (ICD-10 and SNOMED via NIH Clinical Tables)

Every patient-aware tool reads SHARP-on-MCP headers (x-fhir-server-url, x-fhir-access-token, x-patient-id), pulls the chart, and respects explicit-arg overrides for cases where the agent has the data inline.

How we built it

Stack: Python 3.13, FastMCP, async httpx with parallel fan-out to seven public NIH/FDA APIs, Pydantic v2, SQLite for offline bundled datasets (~30 MB)
Deployment: GitHub → Render auto-deploy
Data sources: NIH RxNorm, OpenFDA, NIH Clinical Tables, RxClass, MedlinePlus Connect, DailyMed, NCBI PubMed, ClinicalTrials.gov v2, AGS Beers Criteria 2023 (bundled), DDInter 2.0 (bundled, ~160k pairs), CDC ACIP Adult Immunization Schedule (bundled)
SHARP-on-MCP compliance is end-to-end. Every patient-aware tool reads x-fhir-server-url, x-fhir-access-token, x-patient-id. JWT patient claim takes precedence over the explicit header per the spec.
FHIR-native read and emit. Chart reads pull MedicationStatement with a MedicationRequest fallback (because many EHR FHIR servers populate provider intent, not patient attestation), Condition, and AllergyIntolerance in parallel with graceful per-resource degradation. Every safety review emits a FHIR RiskAssessment resource with one prediction per finding — ready to persist back without an adapter layer.

Challenges we ran into

Duplicate-class detection at the right granularity. We started at ATC L4 (chemical subgroup), which catches "two SSRIs" but misses "warfarin + apixaban" — they share L3 (antithrombotic, B01A) but split at L4. Moving to L3 catches the real clinical danger without false-positive noise on combination products. Validating the cutoff against AGS Beers + DDInter ground truth took a full afternoon.
False-negative safety in the empty-chart case. When an FHIR chart returns zero active medications, a naïve safety review reports "no findings." A foreign AI agent reads that as a clean bill of health and tells the clinician the patient is safe. Re-architected the response to expose chart_status: "empty_chart" and a ⛔ "NO REVIEW PERFORMED" banner that downstream agents can't politely soften.
Robustness across FHIR sandboxes. Different R4 servers index different search params. Defensive fallback: if MedicationStatement?status=active returns nothing, retry without the filter and post-filter on the resource's own status field. Active/inactive cut preserved either way.

Accomplishments we're proud of

Real clinical bugs caught by class-level synthesis that nothing else would catch in one call
Native FHIR R4 read and emit, including a RiskAssessment resource that round-trips to the chart
Graceful degradation everywhere — return_exceptions=True on the parallel chart fetch, unparseable-entry tracking instead of silent drops, unresolved-RxCUI surfacing as warnings
Zero PHI: every tool path assumes synthetic test data and respects the principle of least context

What we learned

The worst failure mode for a clinical safety tool isn't a false positive. It's a silent false negative — a confident "no issues" against an empty or partial chart. Half our hardening work this hackathon was making the response impossible to misinterpret as "patient is safe."