ORION's Workflow
Voice Driven UI Navigation
Real-time Operative Report generation
Real-time phase detection & surgical photo capture
Visual Assistance mode
Decision Support - Complication advisory
Drug Checker agent
Pre-Op briefing
Problem Statement
Solutions
Medium Article

ORION — Operating Room Intelligent Orchestration Node [Live Agents + UI Navigation]

From Idea to Impact

The Problem - Surgeon's hands locked in. They can not type, click, or interact with any computer system.

Robotic surgery solves many physical limitations of open procedures, but creates a paradox: the surgeon gains precision but loses access. For hours at a time, their hands are locked on the da Vinci controls inside a sterile field. Every critical piece of information — patient labs, CT scans, drug safety, complication protocols — is one broken scrub or one distracted circulating nurse away.

The evidence:

WHO checklists are skipped under pressure — implementing the checklist reduces mortality by 47% and complications by 36%, yet consistent execution remains elusive in real OR conditions
Blood loss is underestimated by over 50% — surgeons are wrong by more than 25% in 95% of cases, delaying transfusion decisions at the moment they matter most
Operative notes get written from memory 15+ days later — vs. 28 minutes with real-time voice templates, sacrificing accuracy when it counts most for medicolegal and continuity-of-care purposes
1 in 20 drug administrations has an error — 80% preventable with a simple cross-check against the patient's allergies and current medications
Critical View of Safety is rarely confirmed — only 23.1% of laparoscopic cholecystectomies have CVS documented before bile duct division, the single step that prevents the majority of bile duct injuries

These aren't edge cases. They are systematic, evidence-backed gaps that occur in operating rooms every day — and they are all solvable with voice-directed AI.

The Solution - a surgical Co-Pilot that listens, understands, thinks & ACTS on surgeon's behalf

Surgeon's voice input → Live Agents's Response → UI Navigation

ORION is a voice-activated surgical co-pilot that listens continuously throughout the procedure. The surgeon speaks naturally — no button press, no sterile field break — and ORION responds in under a second with the right information on the console display and a calm, brief spoken confirmation.

The core insight: the Gemini Live API's native audio dialog model supports simultaneous PCM audio input, audio output, function calling, and real-time image streaming in a single bidirectional session.

That combination — listen, see, think, and speak simultaneously — is exactly what an OR co-pilot requires, making ORION the surgeon's hands on screen.

ORION maps each OR domain to a specialist: pre-op briefing, safety timeout, blood loss tracking, drug safety, anatomy guidance, complication protocols, operative documentation, SBAR handoff, and visual field analysis. The root orchestrator routes intent to the right agent in real time. The surgeon never thinks about routing — they just talk.

What It Does [Live Agents + UI Navigation]

ORION is a real-time surgical co-pilot that listens continuously throughout the procedure and responds to natural voice commands:

Intent	Example Commands	Response / Action
Patient data on demand	"Show allergies", "Display all labs"	Clinical cards appear instantly on the console
CT imaging navigation	"Jump to the tumor", "Next 5 slices", "Jump to the tumor"	Opens CT-view panel, navigates to the requested slice or landmark
3D anatomy reference	"Show the bronchus", "Rotate the model left", "Spin it on Y axis", "Reset the anatomy view"	Live anatomical context rendered in 3D model panel
WHO Safety Timeout	"Run the timeout"	Guided checklist with verbal confirmation of all items
Pre-op briefing	"Brief me on this case"	50-word structured summary from patient record
Blood loss tracking	"Blood loss 200 mL", "How much blood have we lost?"	Running EBL with threshold alerts at 15%, 25%, 40%
Drug safety checks	"Is cefazolin safe for this patient?"	Allergy cross-check with alternatives if contraindicated
Complication protocols	"I have bleeding 1000 mL, how to handle this complication"	Step-by-step SCAT protocol read aloud, anatomy highlighted
Surgical phase checklist	"What phase are we in", "Show vascular dissection checklist"	Phase checklist tile with steps and warnings
Anatomy guidance	"What's at risk here?", "What's Danger zone for this phase"	Phase-aware anatomical pearls, CT landmark jump
Live visual analysis	"What do you see?", "Enter visual assistance mode", "Is there bleeding?"	Surgical video + full screen capture streamed to Gemini; Visual Assistant reads the operative field, identifies structures, reads external monitors and EMR
Intraoperative documentation	"Log CVS confirmed", "Note: specimen removed"	Timestamped event log entry
Capture surgical photo	"Document this view", "Capture a photo"	Timestamped event log entry with captured image
Operative report	"Generate the report"	Narrative summary from session log
SBAR handoff	"Prepare handoff"	Structured Situation, Background, Assessment, Recommendation sign-out checklist for shift changes
Hide selective/all panels	"Hide patient data", "Hide everything"	Hides respective/all panels

All outputs appear simultaneously as voice responses and visual cards on the surgical console. The surgeon never types, clicks, or breaks scrub.

Features

✅ Features List	✅ Feature List
Live Agents audio interaction	Barge-in handled naturally
Context-aware Native audio dialog	UI Navigation: Visual UI Understanding & Interaction
Custom voice persona	Grounding: prompt hardening & before/after tool callback
Live video streaming & Screen Share (1fps send_realtime)	Error handling caught mid-stream
Multimodal: simultaneous input	Automated deployment
Transcription: Input and output audio	ADK Multi-agent & multi-tool orchestration

How It Was Built

AI Core — Gemini Live API + Google ADK

The entire system runs on gemini-live-2.5-flash-native-audio via Vertex AI's Live API. This is the only model that supports simultaneous PCM audio input + audio output + function calling + image streaming in a single bidirectional session — exactly what a real-time OR environment demands.

Google ADK (v1.26.0) structures the intelligence as a nine-agent hierarchy:

ORION_Orchestrator (root) — receives all voice input, applies wake-word filtering, calls 22 direct tools for single-action commands, and routes to specialist agents via transfer_to_agent() for complex multi-step protocols
8 specialist sub-agents: Briefing_Agent, Timeout_Agent, Report_Agent, Complication_Advisor, EBL_Tracker, Drug_Checker, Anatomy_Spotter, Handoff_Agent
Screen_Advisor (Visual Assistant) — ORION's visual intelligence layer; activated system-wide when vision commands are issued, receives both the live surgical video feed (320×240 at 1 fps) and a full screen capture stream (768×768 at 1 fps via getDisplayMedia)

Transport Layer — FastAPI WebSocket

Each browser connection runs two concurrent async tasks:

upstream_task — receives 16 kHz PCM audio chunks and JPEG image frames, buffers audio to 100ms chunks, forwards everything to Vertex AI via LiveRequestQueue
downstream_task — receives ADK events, serializes with model_dump_json(by_alias=True), and streams JSON to the browser

Grounding & Safety Layer

Every tool call passes through ADK before/after callbacks. Argument whitelists validate field names, landmark names, phase names, and structure names before any tool executes. The model is instructed never to state clinical values from memory — it always calls the tool.

Frontend — Surgical Console

Vanilla HTML/CSS/JS with no framework. Four-panel dynamic tile layout that expands/contracts as panels show and hide. Real-time routing log, live transcript, agent chip highlighting, tool call metrics. Three.js r128 for 3D GLB rendering. CT PNG slices rendered on canvas.

Infrastructure & CI/CD

Cloud Run + Cloud Build CI/CD. Every push to main automatically builds, pushes to Artifact Registry, and deploys. GCS hosts CT slices, 3D model, and surgical videos

Data Sources

Asset	Source	License
CT imaging (133 slices)	LIDC-IDRI-0001, The Cancer Imaging Archive	CC BY 3.0
3D anatomy model	NIH 3D Print Exchange / Sketchfab	-
Surgical videos	Open-access VATS lobectomy recordings	Per source license
Patient record	Synthetic FHIR-compliant demo data — no real clinical information	N/A
Drug database	Hardcoded pharmacology rules for 10 common intraoperative drugs	N/A
Complication protocols	Structured SCAT protocols derived from open surgical literature	N/A

Google Cloud Services

Service	Purpose
Vertex AI	Hosts `gemini-2.5-flash-preview-native-audio-dialog` — live audio, function calling, and image streaming
Cloud Run	Serverless container hosting for the FastAPI WebSocket backend
Cloud Build	CI/CD pipeline — auto-builds and deploys on every push to `main`
Artifact Registry	Stores Docker images built by Cloud Build
Cloud Storage (GCS)	Hosts CT scan slices, 3D anatomy GLB model, and surgical videos

Challenges

Learning the Gemini Live API / ADK (the expected unknowns)

Multi-agent live sessions — In a run_live() session, ALL agents in the hierarchy must use a native audio model. gemini-2.5-flash (text-only) is silently accepted at definition time but causes runtime failures — discovered only after all agent code was written.
Sub-agent audio routing — Early builds filtered audio events by event.author === 'ORION_Orchestrator'. This silenced all sub-agent responses. ADK's multi-agent live flow has sub-agents generate the audio; audio must be forwarded from all authors.
FIRST_EXCEPTION vs FIRST_COMPLETED — Using asyncio.FIRST_COMPLETED killed multi-turn sessions after the first turn completed. FIRST_EXCEPTION (matching ADK's own implementation) was the fix.
transfer_to_agent is not a callable tool — Early versions defined it as a tool in the root agent's tools=[]. This caused ValueError: tool 'transfer_to_agent' not found. It's an ADK internal mechanism, not a user-defined tool.
getDisplayMedia() permission dialogs — Calling it on every Screen_Advisor activation triggered a browser permission dialog each time the agent was routed to. Solved by acquiring the stream once and keeping it alive across activations (activate/deactivate/teardown API), with getDisplayMedia() called only on first use.

Architecture and Design Challenges

Zombie sessions — When downstream_task caught an exception and returned normally, upstream_task kept the WebSocket open indefinitely. The UI showed "active" but ORION had stopped responding. Fixed by re-raising after browser notification, triggering FIRST_EXCEPTION and clean teardown with auto-reconnect.
Screenshare deactivation on sub-agent routing — Vision mode deactivated whenever routing changed to a non-vision agent (e.g., Complication_Advisor). Fixed by adding complication and anatomy tools directly to Screen_Advisor so it handles those queries without transferring.
Continuous video cost — Sending 1 fps surgical video frames continuously throughout the session consumed significant Gemini input token budget. Vision mode is now system-managed: both streams activate only when Screen_Advisor is the active agent.
Tool call deduplication — The Live API occasionally fires duplicate function call events within milliseconds. A 4-second deduplication cache (Map<key, timestamp>) prevents double-execution of display tools.
CT/3D discoverability — The model didn't know that navigate_ct() and reset_3d_view() also show their respective panels (not just navigate/reset). Explicit examples in the root agent instruction resolved this.

Accomplishments

Built a fully working real-time multi-modal AI agent that can listen, see, think, and speak simultaneously in an environment (the OR) where latency and reliability matter more than almost anywhere else
Nine-agent hierarchy routing correctly across all surgical domains — briefing, timeout, blood loss, drug safety, complications, anatomy, documentation, handoff, and visual analysis — all from natural speech
Visual Assistant (Screen_Advisor) streams both surgical video and full screen capture to Gemini simultaneously, enabling the model to read external monitors, EMR screens, and operative fields without any API integration with hospital systems
A grounding layer (ADK before/after callbacks + argument whitelists) that prevents hallucination on clinical data — the model cannot state a lab value it didn't retrieve from a tool
Full Cloud Run deployment with automated CI/CD — push to main, service is live in ~3 minutes
Zero-click surgical console: the entire UI is driven by voice. The surgeon can navigate CT scans, rotate 3D models, run WHO protocols, capture photos, and generate operative reports without touching anything

What's Next

Real EHR integration — Replace synthetic patient data with FHIR-compliant live patient record pull.
Validated drug database — Replace the hardcoded pharmacology rules with a live API-backed formulary (e.g., FDA DailyMed) with real allergy cross-checks against the patient's current medications.
Post-op workflow — Extend ORION's session log into a structured FHIR operative note that can be pushed to the EHR directly at case close, solving the 15.6-day documentation delay problem end-to-end.

Built With

adk
cloudbuild
cloudrun
fastapi
geminiliveapi
googleartifactregistry
html
javascript
python
vertexai
websocket

Updates

Aditya Shukla started this project — Mar 14, 2026 04:30 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.