IRIS - Inference Risk and Integrity Supervisor [GCP Agent Builder + Arize]
Hosted URL: https://iris-clinical-797220546460.us-central1.run.app/
Repo: https://github.com/adityashukla8/iris
Demo video: https://youtu.be/vlp-cjAwNNU
Note for Judges: When trying out the live dashboard site, please reset the session by clicking "Reset for New Demo" to start with a fresh session state (currently, session management isn't implemented).
From Idea to Impact
The problem - Clinical AI agents are being deployed in operating rooms and ICUs with inadquate runtime safety layer
Clinical AI is entering the highest-stakes environments in medicine. Agents are recommending drugs, dosing regimens, and care plans directly to providers. When they hallucinate, they do so silently - no alarm fires, no circuit breaker trips, and no one knows until a nurse catches it or, worse, until they do not.
The evidence is unambiguous:
- 91.8% of clinicians surveyed reported encountering medical AI hallucinations; 84.7% said those hallucinations were capable of causing direct patient harm (MIT Media Lab et al., medRxiv 2025, n=70)
- 83% of cases where leading LLMs repeated or elaborated on a single planted clinical error - fabricated lab values and conditions propagating into unsafe recommendations (Communications Medicine, Nature, 2025, n=300 vignettes)
- $42B annual global cost of medication errors, with AI agents now recommending drugs and doses in the same clinical environments (WHO - Medication Without Harm)
Beyond catching individual failures: there is no mechanism to detect that the same type of failure has happened 15 times this week, identify the root cause, fix it, and verify the fix works - all without a prompt engineer filing a ticket. Clinical AI deployments accumulate failures invisibly. The prompt that caused harm today will cause the same harm tomorrow.
The Solution
IRIS is a self-healing observability layer that sits between clinical AI agents and their users.
Every agent output is submitted to IRIS, which evaluates it across 7 clinical safety evaluators in real time, traces the result to Arize Phoenix, and - when a recurring failure pattern is detected
- autonomously rewrites the failing system prompt
- validates the fix against real captured failures
- deploys the improved version back to Phoenix.
The source agent gets safer over time without human intervention.
The thesis: hospitals need a purpose-built trust and safety supervisor for clinical AI, not generic observability tooling.
The difference is clinical grounding (IRIS knows that metformin 10,000 mg is dangerous and 2,000 mg is not, because it checks RxNorm and OpenFDA) combined with a closed remediation loop (not just "here is what went wrong" but "here is the fixed prompt, here is proof it works, here is the before/after score").
What It Does [ADK + Vertex AI + Arize Phoenix MCP]
IRIS evaluates every clinical AI output the moment it is produced and closes the improvement loop automatically:
Real-time evaluation
Any clinical agent submits its output via POST /event. IRIS runs 7 safety evaluators concurrently using Gemini 2.5 Flash on Vertex AI, each grounded in trusted clinical sources:
| Evaluator | What it catches | Knowledge source |
|---|---|---|
factual_hallucination |
Drug names not in RxNorm; impossible values; invented procedures | RxNorm API + LLM judge |
dosage_boundary |
Doses outside FDA-approved range; missing renal adjustment | OpenFDA API + LLM judge |
drug_interaction |
Dangerous drug-drug combinations in the regimen | OpenFDA + LLM judge |
allergy_contraindication |
Drug from a class the patient is allergic to | Allergy context + LLM judge |
attribution |
Claims not traceable to the patient record | LLM judge |
context_gap |
Clinical question answered without required patient variables | LLM judge (dynamic inference) |
surgical_phase |
Recommendations inappropriate for the current surgical phase | Phase taxonomy + LLM judge |
Results are written as OTel span attributes and REST annotations to Arize Phoenix Cloud in real time. The live dashboard receives a Server-Sent Events stream of every alert.
Autonomous self-healing
When a failure cluster crosses the configured threshold, IRIS fires an event-driven scan:
- The
pattern_detectorADK LlmAgent reads recent spans and annotations through the Arize Phoenix MCP server (get-spans,get-span-annotations) to identify the failure cluster - which agent, which query type, which failure pattern. - A diagnosis step (Gemini 2.5 Flash) extracts root cause from real captured failure traces and logs them to a Phoenix dataset.
- The mutation engine generates textual gradients per failing example, then synthesizes a targeted constraint for the system prompt.
- The validation gate re-runs the candidate prompt against 5 real captured failures using Gemini 2.5 Flash at
temperature=0.0, seed=42. The candidate must improve the mean safety score by the configured improvement threshold to pass. - The validated prompt is deployed to Phoenix as a new versioned version tagged
production. The agent fetches it on the next run.
Closing the visible loop - Live Agent Mode
For the demo purposes, the dashboard's simulation modal has two modes:
- Synthetic Response mode: it runs pre-scripted unsafe outputs - a failure baseline.
- Live Generation mode: IRIS generates each clinical answer in real time using Gemini 2.5 Flash under the current production prompt fetched from Phoenix (real-world scenario).
The Agent Improvement panel on the Overview tab shows both runs side by side: pass rate before and after the heal, critical delta, and the prompt hash that changed.
MCP probe
A second ADK LlmAgent - mcp_probe - is exposed at POST /mcp/chat. It gives operators a conversational interface to their Phoenix observability data: spans, prompts, datasets, experiments, evaluations. 10 read-only Phoenix MCP tools.
Features
| Feature | Detail |
|---|---|
| Google ADK agents | Two ADK LlmAgents - pattern_detector introspects its own Phoenix observability data through the MCP server; mcp_probe exposes 10 read-only Phoenix MCP tools via conversational chat |
| Gemini 2.5 Flash + Pro on Vertex AI | Flash drives all 7 evaluator judges, the mutation engine, and live generation; Pro drives the pattern detector (required for Phoenix MCP tool schemas) |
| TextGrad-style mutation engine | Per-example textual gradients generated by Gemini, synthesized into a targeted system prompt constraint |
| Counterfactual validation gate | Candidate prompt re-evaluated against real captured failures with Gemini at temperature=0.0, seed=42 before deployment |
| Live Generation mode | Clinical answers generated in real time by Gemini under the current production prompt; improvement is measured, not assumed |
| Automated Cloud Run deploy | CI/CD via Cloud Build; single warm instance for SSE state and MCP spawn consistency |
| Arize Phoenix OTel tracing | Per-evaluator span attributes and REST annotations written to Phoenix Cloud on every event |
| Phoenix prompt versioning | Every system prompt version is content-hashed, tagged production or candidate, and stored in Phoenix; live agent fetches the current tag at run time |
| Phoenix MCP integration | @arizeai/phoenix-mcp@4.0.8 with before/after ADK tool callbacks - 40+ bloat attributes stripped from responses for 50-80% context size reduction |
| Before/after comparison panel | Pass rate delta, critical delta, and prompt hash change shown side by side - sourced from Phoenix prompt tags |
| Event-driven scan trigger | Autonomous heal fires when a failure cluster crosses threshold; no cron job needed |
| Human-in-the-loop option | HEALING_AUTO_APPROVE=false queues candidates for manual approval via the dashboard |
| Clinical grounding | RxNorm API validates every drug name deterministically; OpenFDA provides approved dosage ranges and interaction data |
| 9-scenario simulation suite | Covers drug dosage errors, hallucinated drugs, missed allergies, drug interactions, context gaps, surgical phase violations |
| Real-time dashboard | SSE alerts, evaluator heatmap, severity timeline, trace inspector with full prompt visibility, MCP chat |
How It Was Built
Evaluation layer
Seven EvalPlugin subclasses in core/evaluators/ each receive an IrisEvent and return an EvalResult with score (0-10), severity, passed flag, rationale, and flagged claims. core/evaluators/service.py runs all applicable evaluators as concurrent asyncio.gather tasks. The worst-severity result drives the alert.
factual_hallucination is a hybrid: it first calls the RxNorm API to verify every drug name the agent mentioned (a deterministic CRITICAL if any drug is unknown), then uses Gemini 2.5 Flash as an LLM judge for the broader hallucination check. dosage_boundary and drug_interaction query OpenFDA drug labels for the specific drugs mentioned in the output.
All LLM judge calls use temperature=0.0, seed=42. This was essential for demo reproducibility - without seeding, the same borderline output would score 6.8 one run and 7.2 the next, making before/after comparison meaningless.
ADK agents and MCP integration
pattern_detector is a Gemini 2.5 Pro LlmAgent with McpToolset pointing at @arizeai/phoenix-mcp@4.0.8 (launched via npx). Pro is required here - Gemini 2.5 Flash emits MALFORMED_FUNCTION_CALL against Phoenix MCP's complex tool schemas.
A before_tool_callback clamps the limit argument on get-spans calls (the LLM sometimes inflates it beyond the MCP server's allowed range, which wastes a full tool round-trip with a too_big error). An after_tool_callback strips 40+ bloat attributes from MCP responses before they enter the LLM context - Phoenix stores full Gemini request/response JSON on every ADK span, which for 10 spans runs to 200K+ characters and approaches Gemini's context limit.
The pattern detector has a deterministic clustering fallback: if the MCP call fails (Phoenix Cloud outage, network hiccup), it clusters by (agent, query_type, severity) directly from recent_traces in memory.
Self-healing pipeline
Everything between pattern detection and prompt deployment is deterministic Python - no ADK agent involved. This was a deliberate design choice: the healing pipeline must be reliable and auditable. The ADK agent is right for open-ended MCP introspection; a deterministic pipeline is right for the critical path of rewriting a prompt that will run against patients.
The mutation engine uses a TextGrad-style two-pass approach: Gemini generates a textual gradient per failing example ("what instruction would have prevented this specific failure"), then synthesizes the per-example gradients into a single targeted constraint. This produces specific, actionable constraints ("Always verify gentamicin dosing against renal function before recommending") rather than generic additions.
Validation re-runs the candidate prompt against 5 real captured failures using RESPONDER_PROMPT - the exact same template used in live mode, with no safety scaffold. The gate measures what the candidate actually produces under live conditions, not what it produces with extra hand-holding. This alignment between validation and live mode was what made the improvement measurable.
Infrastructure
FastAPI with SSE for real-time alert streaming. Cloud Run with --min-instances=1 (state is in-memory; cold start + npx MCP spawn on first judge visit would be unacceptable) and --max-instances=1 (in-memory deques would shard across instances otherwise). Cloud Build CI/CD triggers on every push to main.
Data Sources
| Source | How IRIS uses it |
|---|---|
| RxNorm API (NLM) | Validates every drug name mentioned in a clinical AI output; CRITICAL if any name is not in the RxNorm concept database |
| OpenFDA drug labels API | Provides FDA-approved dosage ranges and drug-drug interaction data for the specific drugs mentioned |
| Arize Phoenix Cloud | Source of truth for all evaluation spans and annotations; the pattern detector reads its own observability data through Phoenix MCP to identify failure clusters |
| Synthetic patient records | Nine demo scenarios use synthetic FHIR-style patient records with plausible clinical parameters (creatinine clearance, allergies, current medications, surgical phase) - no real patient data |
Google Cloud Services
| Service | Purpose |
|---|---|
| Vertex AI | Hosts all Gemini calls - Gemini 2.5 Flash for the 7 evaluators, live agent generation, and mutation engine; Gemini 2.5 Pro for the ADK pattern detector agent |
| Cloud Run | Serverless container for the FastAPI service; single warm instance for SSE state and MCP spawn consistency |
| Cloud Build | CI/CD pipeline - builds Docker image, pushes to Artifact Registry, deploys to Cloud Run on every push to main |
| Artifact Registry | Docker image storage |
| Secret Manager | Stores PHOENIX_API_KEY and PHOENIX_CLIENT_URL; injected at deploy time by Cloud Build |
Phoenix Services
| Service | How IRIS uses it |
|---|---|
| OTel spans (OpenInference) | Every POST /event produces a span with all 7 evaluator scores, severities, rationales, and flagged claims as span attributes |
| REST span annotations | Structured per-evaluator annotations via /v1/span_annotations written after each evaluation - gives Phoenix's annotation UI per-evaluator pass/fail cards |
| Prompt versioning + tagging | Every system prompt version is stored in Phoenix under {agent}-system (e.g. care-advisor-v2-system); healed prompts are tagged production or candidate; the live agent fetches the current production tag at run time |
| Datasets | Real failing traces are logged to a Phoenix dataset per heal cycle for reproducibility and audit |
| Experiments | Heal validation runs are logged as Phoenix experiments - before/after scores, per-example results, improvement delta all visible in the Phoenix UI |
| Phoenix MCP server | The pattern_detector ADK LlmAgent and mcp_probe agent connect via @arizeai/phoenix-mcp@4.0.8; 10 read-only tools expose spans, annotations, prompts, datasets, experiments, and evaluations to the LLM |
Challenges
MCP response bloat
Phoenix stores the full Gemini request/response JSON on every ADK-generated span as gcp.vertex.agent.llm_request and gcp.vertex.agent.llm_response. For 10 spans this is 200K+ characters (~50K tokens), which approaches Gemini's context limit before the agent has written a single word of analysis. The solution was an after_tool_callback in mcp_filter.py that strips 40+ named attributes and all attribute key prefixes in the bloat set from every MCP response before it enters the LLM context - a 50-80% size reduction with no loss of information the agent needs.
Determinism in evaluation and validation
Early demo runs produced different severity verdicts on identical inputs because LLM judges at non-zero temperature fluctuate near decision boundaries. A score of 6.8 on one run and 7.2 on the next (pass threshold is 7.0) made the demo unreliable and the before/after comparison meaningless. Setting temperature=0.0, seed=42 on all evaluator and validation generation calls made scores reproducible across runs.
Event-driven scan trigger timing
The auto-scan trigger was silently failing to fire even after 3 critical events from the same agent. Two bugs: the current event was not yet in recent_traces when the threshold check ran (dispatch_alert is called before appendleft), so the count was always one short; and the cluster count was scoped per query_type, so 3 failures of different query types from the same agent never crossed the threshold. Fixed by counting across all query types and adding +1 for the current event not yet in recent_traces.
Accomplishments
- A fully working closed loop: clinical AI output submitted, evaluated in real time across 7 dimensions, failure cluster detected from its own observability data via Phoenix MCP, root cause diagnosed, system prompt mutated, validated against real captured failures, deployed to Phoenix, live agent generates under the improved prompt, before/after improvement measured and displayed. End to end, from first critical alert to improved pass rate, with no human intervention.
- Clinical grounding that goes beyond LLM judgment: RxNorm validation catches hallucinated drug names deterministically; OpenFDA provides actual FDA-approved dosage ranges for the specific drug and patient renal function. The evaluators know clinical facts, not just fluency.
- The mutation engine produces targeted constraints from real failures ("always verify gentamicin dosing against CrCl before recommending") rather than generic additions. The validation gate uses seed=42 deterministic generation so improvement scores are reproducible.
- MCP integration depth: the pattern detector agent uses Phoenix MCP not just to read spans but to read its own evaluation annotations, identify which evaluator flagged which agent, and determine whether a failure cluster is genuine or a one-off. This is the self-improvement-from-own-observability-data criterion made concrete.
- The MCP filter callback design: rather than limiting what the agent can query, IRIS lets the agent request whatever it needs and strips the irrelevant bloat on the way in. The agent gets full context; the LLM context window gets protected.
What's Next for IRIS
Persistence layer - Replace the in-memory state (recent_traces deque, healing history) with SQLite behind the existing state.py API, then Firestore for multi-instance scale. This removes the single-instance Cloud Run constraint and makes audit trails durable across restarts.
Multi-agent fleet view - A /agents endpoint and fleet dashboard table showing per-agent health, active prompt hash, recent pass rate, and time since last heal. The per-agent namespacing is already in place; the fleet view surfaces it.
Statistical heal validation - Current validation is N=5 mean delta. Replace with paired bootstrap confidence intervals and automatic rollback if the live pass rate after deployment does not match the validation estimate within a tolerance band.
SDK packaging - Publish iris-sdk to PyPI with an @iris.guard decorator that wraps any async clinical agent function, submits outputs automatically, and optionally blocks on critical severity. Zero-line integration for agents that already exist.
ORION integration - ORION (Operating Room Intelligent Orchestration Node) is a voice-directed surgical assistant built on Gemini Live + ADK (won the Grand Prize for Gemini Live Agent Hackathon). IRIS was designed as the safety supervisor for exactly this class of agent. Connecting ORION as the first production IRIS client closes the loop from the evaluation paper to the operating room.
Built With
- arize
- arizephoenix
- cloudrun
- gcp
- gemini
- vertexai
Log in or sign up for Devpost to join the conversation.