High Level Workflow Diagram
Detailed Workflow Diagram
Architecture
7 Evaluators Trace
Self-healing Candidates
Auto-deployed self-healed prompt version
Failure Datasets
Traces & Spans for Evaluators and Chains

IRIS - Inference Risk and Integrity Supervisor [GCP Agent Builder + Arize]

Hosted URL: https://iris-clinical-797220546460.us-central1.run.app/

Repo: https://github.com/adityashukla8/iris

Demo video: https://youtu.be/vlp-cjAwNNU

Note for Judges: When trying out the live dashboard site, please reset the session by clicking "Reset for New Demo" to start with a fresh session state (currently, session management isn't implemented).

From Idea to Impact

The problem - Clinical AI agents are being deployed in operating rooms and ICUs with inadquate runtime safety layer

Clinical AI is entering the highest-stakes environments in medicine. Agents are recommending drugs, dosing regimens, and care plans directly to providers. When they hallucinate, they do so silently - no alarm fires, no circuit breaker trips, and no one knows until a nurse catches it or, worse, until they do not.

The evidence is unambiguous:

91.8% of clinicians surveyed reported encountering medical AI hallucinations; 84.7% said those hallucinations were capable of causing direct patient harm (MIT Media Lab et al., medRxiv 2025, n=70)
83% of cases where leading LLMs repeated or elaborated on a single planted clinical error - fabricated lab values and conditions propagating into unsafe recommendations (Communications Medicine, Nature, 2025, n=300 vignettes)
$42B annual global cost of medication errors, with AI agents now recommending drugs and doses in the same clinical environments (WHO - Medication Without Harm)

Beyond catching individual failures: there is no mechanism to detect that the same type of failure has happened 15 times this week, identify the root cause, fix it, and verify the fix works - all without a prompt engineer filing a ticket. Clinical AI deployments accumulate failures invisibly. The prompt that caused harm today will cause the same harm tomorrow.

The Solution

IRIS is a self-healing observability layer that sits between clinical AI agents and their users.

Every agent output is submitted to IRIS, which evaluates it across 7 clinical safety evaluators in real time, traces the result to Arize Phoenix, and - when a recurring failure pattern is detected

autonomously rewrites the failing system prompt
validates the fix against real captured failures
deploys the improved version back to Phoenix.

The source agent gets safer over time without human intervention.

The thesis: hospitals need a purpose-built trust and safety supervisor for clinical AI, not generic observability tooling.

The difference is clinical grounding (IRIS knows that metformin 10,000 mg is dangerous and 2,000 mg is not, because it checks RxNorm and OpenFDA) combined with a closed remediation loop (not just "here is what went wrong" but "here is the fixed prompt, here is proof it works, here is the before/after score").

What It Does [ADK + Vertex AI + Arize Phoenix MCP]

IRIS evaluates every clinical AI output the moment it is produced and closes the improvement loop automatically:

Real-time evaluation

Any clinical agent submits its output via POST /event. IRIS runs 7 safety evaluators concurrently using Gemini 2.5 Flash on Vertex AI, each grounded in trusted clinical sources:

Evaluator	What it catches	Knowledge source
`factual_hallucination`	Drug names not in RxNorm; impossible values; invented procedures	RxNorm API + LLM judge
`dosage_boundary`	Doses outside FDA-approved range; missing renal adjustment	OpenFDA API + LLM judge
`drug_interaction`	Dangerous drug-drug combinations in the regimen	OpenFDA + LLM judge
`allergy_contraindication`	Drug from a class the patient is allergic to	Allergy context + LLM judge
`attribution`	Claims not traceable to the patient record	LLM judge
`context_gap`	Clinical question answered without required patient variables	LLM judge (dynamic inference)
`surgical_phase`	Recommendations inappropriate for the current surgical phase	Phase taxonomy + LLM judge

Results are written as OTel span attributes and REST annotations to Arize Phoenix Cloud in real time. The live dashboard receives a Server-Sent Events stream of every alert.

Autonomous self-healing

When a failure cluster crosses the configured threshold, IRIS fires an event-driven scan:

The pattern_detector ADK LlmAgent reads recent spans and annotations through the Arize Phoenix MCP server (get-spans, get-span-annotations) to identify the failure cluster - which agent, which query type, which failure pattern.
A diagnosis step (Gemini 2.5 Flash) extracts root cause from real captured failure traces and logs them to a Phoenix dataset.
The mutation engine generates textual gradients per failing example, then synthesizes a targeted constraint for the system prompt.
The validation gate re-runs the candidate prompt against 5 real captured failures using Gemini 2.5 Flash at temperature=0.0, seed=42. The candidate must improve the mean safety score by the configured improvement threshold to pass.
The validated prompt is deployed to Phoenix as a new versioned version tagged production. The agent fetches it on the next run.

Closing the visible loop - Live Agent Mode

For the demo purposes, the dashboard's simulation modal has two modes:

Synthetic Response mode: it runs pre-scripted unsafe outputs - a failure baseline.
Live Generation mode: IRIS generates each clinical answer in real time using Gemini 2.5 Flash under the current production prompt fetched from Phoenix (real-world scenario).

The Agent Improvement panel on the Overview tab shows both runs side by side: pass rate before and after the heal, critical delta, and the prompt hash that changed.

MCP probe

A second ADK LlmAgent - mcp_probe - is exposed at POST /mcp/chat. It gives operators a conversational interface to their Phoenix observability data: spans, prompts, datasets, experiments, evaluations. 10 read-only Phoenix MCP tools.

Features

Feature	Detail
Google ADK agents	Two ADK LlmAgents - `pattern_detector` introspects its own Phoenix observability data through the MCP server; `mcp_probe` exposes 10 read-only Phoenix MCP tools via conversational chat
Gemini 2.5 Flash + Pro on Vertex AI	Flash drives all 7 evaluator judges, the mutation engine, and live generation; Pro drives the pattern detector (required for Phoenix MCP tool schemas)
TextGrad-style mutation engine	Per-example textual gradients generated by Gemini, synthesized into a targeted system prompt constraint
Counterfactual validation gate	Candidate prompt re-evaluated against real captured failures with Gemini at temperature=0.0, seed=42 before deployment
Live Generation mode	Clinical answers generated in real time by Gemini under the current production prompt; improvement is measured, not assumed
Automated Cloud Run deploy	CI/CD via Cloud Build; single warm instance for SSE state and MCP spawn consistency
Arize Phoenix OTel tracing	Per-evaluator span attributes and REST annotations written to Phoenix Cloud on every event
Phoenix prompt versioning	Every system prompt version is content-hashed, tagged production or candidate, and stored in Phoenix; live agent fetches the current tag at run time
Phoenix MCP integration	`@arizeai/phoenix-mcp@4.0.8` with before/after ADK tool callbacks - 40+ bloat attributes stripped from responses for 50-80% context size reduction
Before/after comparison panel	Pass rate delta, critical delta, and prompt hash change shown side by side - sourced from Phoenix prompt tags
Event-driven scan trigger	Autonomous heal fires when a failure cluster crosses threshold; no cron job needed
Human-in-the-loop option	HEALING_AUTO_APPROVE=false queues candidates for manual approval via the dashboard
Clinical grounding	RxNorm API validates every drug name deterministically; OpenFDA provides approved dosage ranges and interaction data
9-scenario simulation suite	Covers drug dosage errors, hallucinated drugs, missed allergies, drug interactions, context gaps, surgical phase violations
Real-time dashboard	SSE alerts, evaluator heatmap, severity timeline, trace inspector with full prompt visibility, MCP chat

How It Was Built

Evaluation layer

Seven EvalPlugin subclasses in core/evaluators/ each receive an IrisEvent and return an EvalResult with score (0-10), severity, passed flag, rationale, and flagged claims. core/evaluators/service.py runs all applicable evaluators as concurrent asyncio.gather tasks. The worst-severity result drives the alert.

factual_hallucination is a hybrid: it first calls the RxNorm API to verify every drug name the agent mentioned (a deterministic CRITICAL if any drug is unknown), then uses Gemini 2.5 Flash as an LLM judge for the broader hallucination check. dosage_boundary and drug_interaction query OpenFDA drug labels for the specific drugs mentioned in the output.

All LLM judge calls use temperature=0.0, seed=42. This was essential for demo reproducibility - without seeding, the same borderline output would score 6.8 one run and 7.2 the next, making before/after comparison meaningless.

ADK agents and MCP integration

pattern_detector is a Gemini 2.5 Pro LlmAgent with McpToolset pointing at @arizeai/phoenix-mcp@4.0.8 (launched via npx). Pro is required here - Gemini 2.5 Flash emits MALFORMED_FUNCTION_CALL against Phoenix MCP's complex tool schemas.

A before_tool_callback clamps the limit argument on get-spans calls (the LLM sometimes inflates it beyond the MCP server's allowed range, which wastes a full tool round-trip with a too_big error). An after_tool_callback strips 40+ bloat attributes from MCP responses before they enter the LLM context - Phoenix stores full Gemini request/response JSON on every ADK span, which for 10 spans runs to 200K+ characters and approaches Gemini's context limit.

The pattern detector has a deterministic clustering fallback: if the MCP call fails (Phoenix Cloud outage, network hiccup), it clusters by (agent, query_type, severity) directly from recent_traces in memory.

Self-healing pipeline

Everything between pattern detection and prompt deployment is deterministic Python - no ADK agent involved. This was a deliberate design choice: the healing pipeline must be reliable and auditable. The ADK agent is right for open-ended MCP introspection; a deterministic pipeline is right for the critical path of rewriting a prompt that will run against patients.

The mutation engine uses a TextGrad-style two-pass approach: Gemini generates a textual gradient per failing example ("what instruction would have prevented this specific failure"), then synthesizes the per-example gradients into a single targeted constraint. This produces specific, actionable constraints ("Always verify gentamicin dosing against renal function before recommending") rather than generic additions.

Validation re-runs the candidate prompt against 5 real captured failures using RESPONDER_PROMPT - the exact same template used in live mode, with no safety scaffold. The gate measures what the candidate actually produces under live conditions, not what it produces with extra hand-holding. This alignment between validation and live mode was what made the improvement measurable.

Infrastructure

FastAPI with SSE for real-time alert streaming. Cloud Run with --min-instances=1 (state is in-memory; cold start + npx MCP spawn on first judge visit would be unacceptable) and --max-instances=1 (in-memory deques would shard across instances otherwise). Cloud Build CI/CD triggers on every push to main.

Data Sources

Source	How IRIS uses it
RxNorm API (NLM)	Validates every drug name mentioned in a clinical AI output; CRITICAL if any name is not in the RxNorm concept database
OpenFDA drug labels API	Provides FDA-approved dosage ranges and drug-drug interaction data for the specific drugs mentioned
Arize Phoenix Cloud	Source of truth for all evaluation spans and annotations; the pattern detector reads its own observability data through Phoenix MCP to identify failure clusters
Synthetic patient records	Nine demo scenarios use synthetic FHIR-style patient records with plausible clinical parameters (creatinine clearance, allergies, current medications, surgical phase) - no real patient data

Google Cloud Services

Service	Purpose
Vertex AI	Hosts all Gemini calls - Gemini 2.5 Flash for the 7 evaluators, live agent generation, and mutation engine; Gemini 2.5 Pro for the ADK pattern detector agent
Cloud Run	Serverless container for the FastAPI service; single warm instance for SSE state and MCP spawn consistency
Cloud Build	CI/CD pipeline - builds Docker image, pushes to Artifact Registry, deploys to Cloud Run on every push to main
Artifact Registry	Docker image storage
Secret Manager	Stores PHOENIX_API_KEY and PHOENIX_CLIENT_URL; injected at deploy time by Cloud Build

Phoenix Services

Service	How IRIS uses it
OTel spans (OpenInference)	Every `POST /event` produces a span with all 7 evaluator scores, severities, rationales, and flagged claims as span attributes
REST span annotations	Structured per-evaluator annotations via `/v1/span_annotations` written after each evaluation - gives Phoenix's annotation UI per-evaluator pass/fail cards
Prompt versioning + tagging	Every system prompt version is stored in Phoenix under `{agent}-system` (e.g. `care-advisor-v2-system`); healed prompts are tagged `production` or `candidate`; the live agent fetches the current `production` tag at run time
Datasets	Real failing traces are logged to a Phoenix dataset per heal cycle for reproducibility and audit
Experiments	Heal validation runs are logged as Phoenix experiments - before/after scores, per-example results, improvement delta all visible in the Phoenix UI
Phoenix MCP server	The `pattern_detector` ADK LlmAgent and `mcp_probe` agent connect via `@arizeai/phoenix-mcp@4.0.8`; 10 read-only tools expose spans, annotations, prompts, datasets, experiments, and evaluations to the LLM

Challenges

MCP response bloat

Phoenix stores the full Gemini request/response JSON on every ADK-generated span as gcp.vertex.agent.llm_request and gcp.vertex.agent.llm_response. For 10 spans this is 200K+ characters (~50K tokens), which approaches Gemini's context limit before the agent has written a single word of analysis. The solution was an after_tool_callback in mcp_filter.py that strips 40+ named attributes and all attribute key prefixes in the bloat set from every MCP response before it enters the LLM context - a 50-80% size reduction with no loss of information the agent needs.

Determinism in evaluation and validation

Early demo runs produced different severity verdicts on identical inputs because LLM judges at non-zero temperature fluctuate near decision boundaries. A score of 6.8 on one run and 7.2 on the next (pass threshold is 7.0) made the demo unreliable and the before/after comparison meaningless. Setting temperature=0.0, seed=42 on all evaluator and validation generation calls made scores reproducible across runs.

Event-driven scan trigger timing

The auto-scan trigger was silently failing to fire even after 3 critical events from the same agent. Two bugs: the current event was not yet in recent_traces when the threshold check ran (dispatch_alert is called before appendleft), so the count was always one short; and the cluster count was scoped per query_type, so 3 failures of different query types from the same agent never crossed the threshold. Fixed by counting across all query types and adding +1 for the current event not yet in recent_traces.

Accomplishments

A fully working closed loop: clinical AI output submitted, evaluated in real time across 7 dimensions, failure cluster detected from its own observability data via Phoenix MCP, root cause diagnosed, system prompt mutated, validated against real captured failures, deployed to Phoenix, live agent generates under the improved prompt, before/after improvement measured and displayed. End to end, from first critical alert to improved pass rate, with no human intervention.
Clinical grounding that goes beyond LLM judgment: RxNorm validation catches hallucinated drug names deterministically; OpenFDA provides actual FDA-approved dosage ranges for the specific drug and patient renal function. The evaluators know clinical facts, not just fluency.
The mutation engine produces targeted constraints from real failures ("always verify gentamicin dosing against CrCl before recommending") rather than generic additions. The validation gate uses seed=42 deterministic generation so improvement scores are reproducible.
MCP integration depth: the pattern detector agent uses Phoenix MCP not just to read spans but to read its own evaluation annotations, identify which evaluator flagged which agent, and determine whether a failure cluster is genuine or a one-off. This is the self-improvement-from-own-observability-data criterion made concrete.
The MCP filter callback design: rather than limiting what the agent can query, IRIS lets the agent request whatever it needs and strips the irrelevant bloat on the way in. The agent gets full context; the LLM context window gets protected.

What's Next for IRIS

Persistence layer - Replace the in-memory state (recent_traces deque, healing history) with SQLite behind the existing state.py API, then Firestore for multi-instance scale. This removes the single-instance Cloud Run constraint and makes audit trails durable across restarts.

Multi-agent fleet view - A /agents endpoint and fleet dashboard table showing per-agent health, active prompt hash, recent pass rate, and time since last heal. The per-agent namespacing is already in place; the fleet view surfaces it.

Statistical heal validation - Current validation is N=5 mean delta. Replace with paired bootstrap confidence intervals and automatic rollback if the live pass rate after deployment does not match the validation estimate within a tolerance band.

SDK packaging - Publish iris-sdk to PyPI with an @iris.guard decorator that wraps any async clinical agent function, submits outputs automatically, and optionally blocks on critical severity. Zero-line integration for agents that already exist.

ORION integration - ORION (Operating Room Intelligent Orchestration Node) is a voice-directed surgical assistant built on Gemini Live + ADK (won the Grand Prize for Gemini Live Agent Hackathon). IRIS was designed as the safety supervisor for exactly this class of agent. Connecting ORION as the first production IRIS client closes the loop from the evaluation paper to the operating room.

Built With

arize
arizephoenix
cloudrun
gcp
gemini
vertexai

Iris - Inference Risk and Integrity Supervisor