Inspiration

Sarah is 47. Over 18 months she saw four specialists: a cardiologist flagged elevated LDL and borderline hypertension, an orthopedist treated her frozen shoulder, her GP prescribed an SSRI for anxiety and sleep disruption, her gynecologist noted irregular periods and said, "That's just perimenopause. It's normal."

Four doctors. Four separate diagnoses. No one connected them. Not because the evidence was missing, but because no tool existed to surface it at the point of care.

Every one of Sarah's symptoms traces back to estrogen decline. But medical knowledge is fragmented across specialties, and clinicians treating a 47-year-old woman in 2026 may still be working from what they learned in training, not from last month's meta-analysis.

AXIOM was built for Sarah's doctors.


What It Does

AXIOM is an MCP server, a Superpower in the PromptOpinion ecosystem, that gives any agent the ability to retrieve, rank, and trust-verify medical evidence without exposing protected health information.

It exposes five composable tools: query_evidence, search_pubmed, ingest_to_chroma, get_new_research, and check_retractions. Any agent on the PromptOpinion platform can invoke these tools, AXIOM handles the evidence infrastructure so the agent doesn't have to.

The pipeline runs in three stages:

PHI Gate → Evidence Retrieval → Quality-Ranked Results

PHI compliance is handled at the platform level. PromptOpinion's SHARP extension scopes patient data and manages FHIR credentials before any context reaches AXIOM. By the time AXIOM's tools are invoked, the clinical context has already been gated.

Evidence retrieval is two-layered. A local ChromaDB vector store pre-seeded with perimenopause and hormonal transition research handles fast semantic search, while the NCBI Entrez API queries PubMed live for the latest indexed literature. Results are re-ranked by a composite scoring algorithm: semantic similarity × study-type hierarchy boost × recency decay. A 2024 meta-analysis will always outrank a 2018 case report with higher raw similarity.

Before anything returns to the calling agent, every PMID is checked against PubMed retraction notices. Retracted articles are permanently excluded. Articles with expressions of concern receive a score penalty and surface with a warning.

To demonstrate the tools working as an end-to-end clinical workflow, we built an AXIOM Agent: an LLM-powered orchestration layer with structured system instructions that directs the model to reason over a patient presentation, decide which MCP tools to invoke and in what order, and synthesize returned evidence into a clinical narrative. AXIOM's AI Agent system instructions reinforce PHI compliance at the orchestration layer, enforcing that all queries are expressed as pathophysiological concepts rather than patient-specific language: no names, no MRNs, no raw clinical quotes ever enter the evidence retrieval pipeline. The agent knows when to run check_retractions before trusting results from query_evidence, when to follow a local knowledge base miss with a live search_pubmed call, and how to present ranked evidence without overstating certainty.


How We Built It

  • FastMCP as the MCP server framework, with five tools published to the PromptOpinion marketplace: query_evidence, search_pubmed, ingest_to_chroma, get_new_research, and check_retractions
  • PromptOpinion SHARP extension for FHIR context propagation via X-FHIR-Server-URL / X-FHIR-Access-Token headers, declared through custom MCP capability negotiation
  • ChromaDB with cosine similarity and a SQLite persistent backend for the local evidence store
  • Sentence Transformers (all-MiniLM-L6-v2) for local embeddings with no external embedding API dependency
  • Biopython / NCBI Entrez API for live PubMed search and retraction verification
  • A custom composite scoring module implementing study-type hierarchy (meta-analysis through editorial), exponential recency decay, and retraction-aware re-ranking
  • Gemini as the demo orchestration agent, configured with structured system instructions governing tool selection, invocation order, PHI handling, and evidence synthesis behavior

Challenges We Ran Into

Composite scoring architecture. Semantic similarity alone is a poor proxy for evidence quality. A highly cited 2018 case report can outscore a 2023 RCT on raw cosine distance. Building the scoring module to weight study design hierarchy and recency decay on top of similarity, while keeping those weights transparent and auditable rather than hidden inside a black box, required careful calibration to avoid over-penalizing older foundational research while still surfacing current evidence.

PromptOpinion SHARP header integration. Propagating FHIR context through the MCP capability negotiation layer required extending FastMCP's initialization options to declare the SHARP extension scopes. The platform's FHIR credentials arrive as custom request headers (X-FHIR-Server-URL, X-FHIR-Access-Token, X-Patient-ID) rather than through standard OAuth flows, which meant writing a custom context extraction layer and a token refresh handler that works within MCP's request lifecycle.


Accomplishments That We're Proud Of

The retraction-checking pipeline is the piece we're most proud of. Clinical AI that surfaces retracted research creates a patient safety problem.

The AXIOM Agent's system instructions are the other piece worth highlighting. Getting an LLM model to behave reliably as a clinical evidence orchestrator, knowing which tools to call, in what order, with what constraints on how uncertainty is communicated, required the same discipline as writing a clinical protocol. The agent reasons about when the evidence is strong enough to surface and when to keep looking.

The composite scoring module encodes something evidence-based medicine has formalized for decades: study hierarchy, built directly into retrieval ranking rather than applied as a post-hoc filter.


What We Learned

Medical evidence retrieval is a trust problem before it is a search problem. The bottleneck is knowing whether to trust what you found. Recency, study design, and retraction status each carry independent signal that raw semantic similarity cannot capture. Building AXIOM meant encoding that signal explicitly rather than treating embedding distance as a sufficient proxy for quality.

Building the demo agent clarified something important about the MCP architecture. The tools are necessary infrastructure, but the orchestration layer determines whether a clinician can act on what comes back. How an agent sequences tool calls, handles ambiguous results, and frames uncertainty matters as much as the tools themselves. A Superpower is only as useful as the agent that wields it, and AXIOM's tools were designed with that constraint in mind.


What's Next for AXIOM Medical Evidence MCP

The next version replaces the flat ChromaDB vector store with a Graph RAG architecture. Medical knowledge is not flat. Estrogen decline connects to lipid metabolism, which connects to cardiovascular risk, which connects to frozen shoulder via inflammatory pathways. A knowledge graph where hormonal state is a first-class node, with typed relationships between conditions, mechanisms, and evidence, would let AXIOM surface those connections rather than returning independent ranked documents.

Any agent invoking AXIOM's tools would then be querying a graph rather than a flat index, following hormonal substrates, traversing mechanistic edges, and returning evidence that reflects the actual biological relationships underneath a patient's presentation. The goal is tracing the full hormonal substrate underneath a patient's presentation and returning evidence that reflects the interconnection. That is the architecture that gets Sarah's four diagnoses into one conversation.

Built With

Share this project:

Updates