Inspiration

A few months ago I was reading about AI scribes hallucinating drug names in clinical transcriptions. Not edge cases - documented incidents. Then I looked at the broader numbers: studies showing AI health chatbots hallucinate in up to 48% of responses, replacing peer-reviewed sources with confident-sounding fiction.

The thing that struck me wasn't that AI gets medical content wrong. It's that there's no systematic gate between AI-generated health content and the people acting on it. Every symptom checker, wellness app, and hospital scribe is shipping AI output directly to patients with no quality layer in between.

That's what MedCred is built to fix.


What It Does

MedCred is a medical AI content integrity monitor. You paste AI-generated health content - a drug recommendation, a treatment protocol, a symptom explanation - and MedCred runs it through a hybrid verdict stack before returning a PASS or FAIL with full evidence.

The system has two deliberate layers working together:


Layer 1 - Deterministic Safety Gate

MedCred runs OpenFDA extraction first, feeds the results into the judge as evidence, then applies the deterministic gate after the verdict:

  1. Extracts all drug names from the content
  2. Looks each one up via OpenFDA
  3. Checks against a deprecated drug list and high-risk pattern rules
  4. If a drug doesn't exist in FDA records → automatic FAIL, no override

This layer handles the obvious failures - fabricated drug names, deprecated medications, dangerous dosing patterns - with hard rules that LLM confidence cannot override.

Example: Content mentioning synaptol fails on OpenFDA evidence fed into the judge; the safety gate then enforces FAIL regardless of the judge's confidence score.


Layer 2 - Phoenix-Informed Judge

For content that clears the gate (real drugs, plausible dosing), MedCred handles subtler failures that rules can't catch - fabricated PMIDs, inflated efficacy percentages, non-existent drug interactions stated as fact.

Here's how it works:

  1. Phoenix MCP call - Before the judge runs, MedCred calls @arizeai/phoenix-mcp get-spans (persistent MCP session on Analyze; ADK McpToolset on the batch path) to query FAIL span history
  2. Drug-aware pattern matching - It extracts drug tokens and citation signals from the current content and matches them against prior failure patterns in Phoenix traces
  3. Specific injection into judge prompt - Not generic fail counts. Specific context like: "content mentioning lisinopril: 8/8 matching prior FAIL spans - treat invented or unsafe claims as high risk"
  4. Gemini judge synthesizes - With this context, external API evidence, and OpenFDA results, the judge produces a verdict with explanation

Example: A citation fraud case - real drugs, plausible dosing, but a fabricated PMID and a 91% success claim - clears the gate entirely. Both drugs exist. No dangerous dosing. Only the Phoenix-injected pattern context catches it. This is a named demo preset in the Analyze UI.


External Verification Chain

Every evaluation hits four real APIs:

Source What it checks
DailyMed Drug label / SPL lookup
PubMed Literature validation for efficacy claims
ClinicalTrials.gov Trial ID verification (fake NCT IDs)
MedlinePlus Consumer health topic matching

Each source returns SUPPORTED, CONTRADICTS, NOT_FOUND, or INCONCLUSIVE - visible in the UI and fed into the judge context.


Self-Improvement Loop

Every evaluation is traced to Phoenix with span annotations. Every 5 evaluations, the self-improve loop runs:

  • Reads recent FAIL patterns from Phoenix trace history
  • Rewrites the judge system prompt via Gemini
  • Validates the candidate prompt on a held-out labeled slice before promotion (baseline vs. candidate agreement %)
  • Only promotes if agreement meets or exceeds the baseline
  • Worse prompts are rejected and logged with baseline vs. candidate agreement %
  • Full 100-case agreement is reported live via POST /calibration/run?mode=fast

The current system achieves 99% agreement with ground truth on the 100-case labeled eval set (40 PASS / 40 FAIL / 20 edge).


How I Built It

Architecture Overview


Backend

  • Runtime: Google ADK - SequentialAgent on the batch path with McpToolset for Phoenix MCP intake; eval_pipeline.py on the Analyze/Theater path with persistent @arizeai/phoenix-mcp get-spans before every judge run
  • LLM: Gemini via Vertex AI (global) - 3.1 Flash Lite for ADK/MCP orchestration and extraction, 2.5 Flash for the medical judge, 2.5 Pro for corrections and self-improve
  • Grounding: Vertex AI Agent Builder datastore backed by GCS - curated medical corpus (drug monographs, clinical guidelines, dangerous pattern references)
  • Framework: FastAPI, Python 3.13, uv
  • Persistence: PostgreSQL via Neon, with in-memory fallback

Observability

  • Arize Phoenix Cloud - all evaluations traced with OpenInference instrumentation (openinference-instrumentation-google-adk)
  • @arizeai/phoenix-mcp - persistent stdio MCP session on Analyze; ADK McpToolset on batch path; get-spans with REST/ADK fallbacks
  • Span annotations - verdict scores, drug matches, citation fraud signals
  • Prompt versioning - create_or_update_prompt pushes validated prompt updates to Phoenix

Frontend

  • Next.js 14, TypeScript, Tailwind CSS, custom UI components
  • SSE streaming - pipeline runs visibly, step by step, in real time
  • 3-state Analyze flow - form → live pipeline progress → structured result
  • History page - accuracy trend chart showing self-improvement effect across runs
  • Theater mode - batch publisher monitoring with per-item forensics rail

Deployment

  • Backend: Railway
  • Frontend: Vercel
  • Database: Neon (serverless Postgres)

Challenges I Ran Into

The gate-dominance problem was the hardest design challenge. The deterministic gate is so effective on obvious cases that Phoenix pattern history had little to do on fake-drug content. Synaptol fails because OpenFDA has no record of it - OpenFDA evidence feeds the judge, and the gate enforces FAIL after the verdict. That meant my "the agent learns from its own failures" story had no demo case to stand on for subtle fraud.

I had to deliberately design a content type where the gate couldn't help. The citation fraud case - real drugs, plausible dosing, but a fabricated PMID and inflated 91% success claim - cleared the gate entirely. Only the Phoenix-injected pattern context, built from prior FAIL spans flagging citation fraud signals, causes the judge to fail it. That's now a named preset in the UI.

The self-improve loop was the second hard problem. The first version updated the judge prompt on every 5th eval with no verification. A prompt that over-learns recent failures can start flagging everything as FAIL. The fix was adding a calibration A/B gate - the candidate prompt only promotes if it meets or exceeds the baseline on a held-out labeled slice. That changed the loop from self-modification into actual self-improvement.


Accomplishments I'm Proud Of

The citation fraud detection path genuinely surprised me during testing. With an empty Phoenix history, the judge passed the citation fraud content. After pre-seeding with real FAIL span history and running the same content again, it failed - correctly, with specific citation fraud signals called out in the verdict. That's the Phoenix loop working exactly as it should: the system got smarter from its own traces without me changing a single line of judge logic.

The calibration gate is the other one. Knowing that every prompt update has to beat a measured baseline before going live made the self-improve loop feel real rather than cosmetic.

59 tests passing. 100-case labeled eval set. 99% agreement. A live /calibration/run endpoint that returns the agreement number in real time. For a solo build, I'm satisfied with that.


What I Learned

Observability without a feedback loop is just logging.

The difference between Phoenix as a debugging tool and Phoenix as a core product component is whether the agent reads its own traces before making decisions. Wiring get-spans into the Analyze critical path - so the judge prompt changes based on real trace history - is what made the integration meaningful rather than decorative.

I also learned that deterministic rules and learned patterns aren't competing approaches. They're complementary layers. The gate handles what rules can express. Phoenix handles what only patterns can reveal. Getting that boundary right took more design iteration than any single feature.


What's Next for MedCred

The paste audit is a working product wedge. The real next steps:

  • Live publisher ingestion - persistent connection to health app APIs and RSS feeds for continuous monitoring, not just on-demand evaluation
  • Physician-labeled calibration expansion - grow the gold set toward clinical credibility, push agreement % higher with domain expert annotation
  • Phoenix monitors API integration - automated alerting when a publisher's content quality degrades below threshold
  • Multi-tenant reviewer queue - clinical safety teams triage FAIL verdicts directly in the dashboard, with human-in-the-loop annotation flowing back into Phoenix

The infrastructure is production-grade. The problem is real and growing. This is the beginning of what medical AI infrastructure should look like.

Built With

Share this project:

Updates