Jarvis - The Operating Room[OR] Guardian

Inspiration

Every clinician managing a monitored patient faces the same impossible task: simultaneously tracking 8+ vital parameters, interpreting trends, managing drugs and fluids, and responding to rapidly evolving clinical events — often for hours, often across multiple patients.

The numbers are alarming. 23–30% of critical monitor alarms are missed or acknowledged late during high-workload phases. Alarm fatigue causes clinicians to ignore 85–99% of monitor alarms in ICU and OR settings. ICU nurses manage 3–4 patients simultaneously, each generating 150+ data points per hour. PACU nurses miss early respiratory depression in 1 in 5 postoperative patients within the first hour.

And the existing solutions don't help. Proprietary monitor integrations (HL7, DICOM, vendor SDKs) cost \$50K–\$250K per system, require 6–18 months of IT deployment, and don't work across monitor brands. Low-resource facilities, field hospitals, and transport teams are locked out entirely.

We asked a simple question: every monitor in every hospital displays the same thing — numbers on a screen and waveforms on a trace. Why can't AI just read the screen like a human does?

That insight — that the camera is the universal medical device interface — became the foundation of Jarvis OR Guardian.

What it does

Jarvis OR Guardian is a multimodal AI copilot that watches the patient monitor so clinicians can watch the patient — across operating rooms, ICUs, emergency departments, PACUs, labor & delivery suites, procedural sedation areas, and transport environments.

Point any phone or tablet camera at any patient monitor. Select your clinical setting. Jarvis becomes your ambient clinical copilot.

Every 1–5 seconds, Jarvis:

Captures a frame from the camera (or accepts manual vitals entry as fallback)
Extracts all visible vital signs via Gemini 2.5 Flash multimodal vision
Calculates derived indicators — MAP (mean arterial pressure) and Shock Index (HR ÷ SBP)
Compares against the patient's baseline and 9 configurable alarm thresholds
Tracks trends across 4 dedicated graphs — Hemodynamic, Respiratory, Temperature, and Fluid Balance & Shock Index
Monitors fluid balance (blood loss, IV fluids, transfusions, urine output) and drug administration
Alerts through 4 severity tiers (NONE → WATCH → CONCERN → CRITICAL) with voice alerts, vibration, and visual overlay
Guides with differential diagnosis, immediate checks, and suggested actions
Predicts clinical risks — hypotension, ICU admission, sepsis, acute kidney injury
Exports a complete clinical monitoring report as a one-click PDF

Clinicians can also speak naturally to Jarvis through Gemini Live bidirectional voice — reporting events, asking clinical questions, and receiving spoken responses in real time, all fused with the full patient context.

What sets Jarvis apart from a simple threshold alarm

Jarvis considers baseline-relative deviations (HR 110 is normal for one patient, alarming for another)
Jarvis tracks trends (a slowly declining MAP over 8 minutes is more concerning than a transient dip)
Jarvis calculates Shock Index — detecting early shock before overt hypotension
Jarvis fuses clinical context ("spinal anesthesia placed" explains a BP drop; blood loss logged explains rising heart rate)
Jarvis provides differential diagnosis and action steps, not just a beep

How we built it

The core insight: camera as sensor

We realized that instead of fighting proprietary monitor APIs, we could use the phone camera as a universal sensor. A single JPEG frame sent to Gemini 2.5 Flash can OCR vital sign numbers, interpret ECG morphology, read SpO₂ pleth quality, detect alarm states on the monitor, and reason about clinical meaning — all in one inference call.

Architecture

The system is deliberately simple:

Layer	Technology	Why
AI Vision & Reasoning	Gemini 2.5 Flash (multimodal)	Best-in-class vision OCR with structured JSON output, ~3s latency
Backend	FastAPI + Uvicorn	Async Python, 4 endpoints, completely stateless
Frontend	Vanilla HTML / CSS / JS	Zero build step, no npm, no framework — just a single-page clinical dashboard
Camera	`getUserMedia` API	Native browser, rear camera, no SDK
Voice I/O	Gemini Live API	Real-time bidirectional native audio via WebSocket
Charts	Chart.js (4 graphs)	Hemodynamic, Respiratory, Temperature, Fluid/Shock Index trends
PDF Export	jsPDF (client-side)	One-click clinical monitoring report, no server round-trip
Deployment	Docker → Google Cloud Run	Single container, auto-scaling, health endpoint

The entire codebase is ~5,100 lines across 4 core files. No external AI SDKs beyond google-genai. The backend is stateless — all clinical state lives in the browser, making deployment trivial and eliminating the need for a database.

Key technical decisions

ROI bounding box: The clinician drags a box over the monitor region on their phone screen. We crop to that region before sending to Gemini — better OCR accuracy, lower token cost, less visual noise.
Structured JSON prompts: We engineered Gemini prompts to return machine-parseable JSON, not narrative text. This lets us pipe extracted vitals directly into trend buffers, alarm engines, and chart updates.
Manual entry fallback: When the camera isn't available (bad angle, ICU rounds, transport), clinicians can type vitals directly. The same trend analysis, alarm detection, and clinical reasoning pipeline processes manual data identically.
Client-side state: All vitals history, trend data, fluid balance, drug logs, and clinical events live in the browser. This means the backend is fully stateless — no database, no session management, trivially deployable to Cloud Run.
4-tier alert engine: We don't just threshold-alarm. The alert engine considers baseline deviation percentages, multi-parameter correlation (HR up + MAP down = more alarming than either alone), and trend trajectory to escalate through NONE → WATCH → CONCERN → CRITICAL.
Setting-adaptive reasoning: Selecting "ICU" vs "OR" vs "ED" changes event button categories, risk model parameters, and the clinical reasoning context sent to Gemini. A ventilated ICU patient gets different clinical guidance than a post-spinal anesthesia OR patient.

The clinical monitoring dashboard

The dashboard packs a complete clinical monitoring system into a single screen:

8 vital cards with color-coded status (HR, SpO₂, NIBP, MAP, EtCO₂, RR, Temp, Shock Index)
4 time-series trend graphs (Hemodynamic, Respiratory, Temperature, Fluid Balance & Shock Index)
Fluid balance tracker with auto-calculated net balance
Drug administration log across 12 categories with timestamps
Clinical events timeline with setting-adaptive quick-log buttons (50+ event types)
9 configurable safety alarm thresholds with visual alarm badges
Clinical insight cards with 7-section structured reasoning (trend interpretation, physiology, differentials, checks, actions)
Clinical risk prediction bars (Hypotension, ICU Admission, Sepsis, AKI)
AI copilot chat with streaming responses fused with full patient context
Gemini Live voice for hands-free bidirectional conversation
One-click PDF export of the complete clinical monitoring session

Patient intake

Before monitoring begins, clinicians fill a structured intake form covering patient demographics, clinical setting selection, procedure details, ASA physical status, comorbidities (12 condition checkboxes), medication status (7 categories), comprehensive anesthesia technique selection (30+ options across Neuraxial, Peripheral Regional, General, Sedation/MAC, and Combined categories), monitoring plan, and baseline vitals. All of this context feeds into every subsequent Gemini analysis and clinical reasoning call.

Challenges we ran into

1. OCR accuracy across monitor brands

Different monitors (Philips, GE, Dräger, Mindray) display vitals in wildly different layouts, fonts, colors, and screen arrangements. Early testing showed Gemini sometimes misread values when the image included too much visual noise — alarm banners, waveform traces, and adjacent bed information.

Solution: The ROI bounding box lets clinicians crop to just the vital sign region. We also added a pre-flight camera check — before monitoring begins, Gemini analyzes the frame and reports which parameters it can and can't read, along with quality issues (glare, blur, angle). This sets expectations upfront and lets the clinician reposition.

2. Structured output reliability

Gemini occasionally returned values wrapped in markdown code fences, or added narrative text around the JSON. This broke our parsing pipeline.

Solution: We built a robust parse_gemini_json function that strips markdown fences and extracts clean JSON, plus prompt engineering that explicitly requests "raw JSON, no markdown fences." The combination brought parsing reliability to near 100%.

3. Latency vs. safety

Each Gemini vision round trip takes 3–8 seconds. For a clinical safety system, that's an eternity. We needed to make clear that Jarvis is a copilot, not a real-time safety interlock.

Solution: We implemented client-side threshold alarms that fire instantly (no API call needed) when vitals cross safety boundaries. The Gemini analysis provides the higher-order clinical reasoning — differentials, trend interpretation, context-aware guidance — while the alarm system provides immediate alerts. The two systems complement each other.

4. Alert fatigue in the AI system

We were reproducing the exact problem we set out to solve — too many alerts. Early versions triggered WATCH-level alerts constantly, desensitizing the clinician.

Solution: We implemented baseline-relative thresholds (not just absolute values), multi-parameter correlation (single mild deviation = WATCH; two simultaneous deviations = CONCERN), and trend-aware escalation (a value that's been stable at a mild deviation doesn't re-alert). The 4-tier cascade with distinct audio, visual, and haptic responses for each tier gives clinicians proportional urgency cues.

5. Voice I/O in a clinical environment

Clinical environments are noisy. We needed voice interaction that works hands-free during procedures when the clinician's hands are in the sterile field.

Solution: We integrated Gemini Live API for native bidirectional audio via WebSocket — the clinician speaks naturally and Jarvis responds with Gemini's native voice. For critical voice alerts (CRITICAL tier), we use browser TTS as a reliable fallback that doesn't require a network round trip.

6. Building a complete clinical system in a hackathon

The scope was enormous — patient intake, camera management, vision analysis, manual entry, 4 trend graphs, fluid balance, drug logging, 50+ clinical event types across 7 settings, alarm systems, risk prediction, chat, voice, PDF export, and PWA support.

Solution: No framework. Vanilla HTML/CSS/JS with zero build step. FastAPI with 4 endpoints. Every feature was built to be independently functional — if camera fails, manual entry works; if voice fails, chat works; if risk prediction hasn't run, trends and alarms still function. This modularity let us build incrementally without any feature blocking another.

Accomplishments that we're proud of

Zero-integration monitoring — works on any patient monitor in any hospital without touching IT infrastructure
Complete clinical workflow — from patient intake to PDF report export, covering the full monitoring session lifecycle
Setting-adaptive intelligence — the same system adapts its reasoning, events, alarms, and risk models for OR, ICU, ED, PACU, L&D, Procedural Sedation, and Transport environments
~5,100 lines of code across 4 files — a complete multimodal AI clinical monitoring system with no external dependencies beyond google-genai
Deployed and live on Google Cloud Run — accessible from any device with a browser

What we learned

The camera-as-sensor paradigm is surprisingly powerful. Gemini 2.5 Flash can reliably extract vital signs, interpret waveform morphology, and detect monitor alarm states from a phone camera JPEG. This eliminates the entire integration problem that has stalled clinical AI monitoring for decades.
Clinical AI needs context, not just data. A heart rate of 110 means very different things for a 25-year-old undergoing appendectomy vs. a 78-year-old with heart failure in the ICU. Fusing patient baseline, clinical setting, comorbidities, medications, events, and trends into every analysis call dramatically improves the quality of clinical reasoning.
Client-side state is an underrated architectural pattern. By keeping all clinical data in the browser, we avoided database complexity, session management, and HIPAA-related server storage concerns — while making the backend trivially deployable and stateless.
Prompt engineering is clinical engineering. The quality of Gemini's clinical reasoning depends heavily on how we structure the prompt — what context we include, how we format the JSON schema, and how we calibrate alert thresholds in the prompt instructions.

What's next for Jarvis OR Guardian

Multi-center validation — testing OCR accuracy against manual vital recording across monitor brands in real OR, ICU, and PACU settings
Multi-patient ICU dashboard — allowing nurses to monitor 3–4 patients simultaneously with cross-patient deterioration alerting
Edge inference — on-device Gemini Nano for offline capability and sub-second latency (critical for transport and field medicine)
Automated Aldrete scoring in PACU for discharge readiness prediction
LMIC deployment — offline-capable version for low-resource settings where 5.7 billion people lack access to safe surgical care
Clinical trial — IRB-approved study measuring alert response time improvement and missed-deterioration rates

Built With

Gemini 2.5 Flash (multimodal vision + reasoning)
Gemini Live API (native bidirectional audio)
FastAPI + Uvicorn
Vanilla HTML / CSS / JavaScript
Chart.js
jsPDF
Docker
Google Cloud Run
Python

Built With

gemini
gemini-live
google
numpy
pandas
pillow
plotly
prompt
python
pyttsx3
speech
streamlit

Submitted to

Gemini Live Agent Challenge

Created by

## My Contribution — Aayush Gupta

**Staff Data Scientist** | End-to-end design & development

- Built the **Gemini 2.5 Flash multimodal vision pipeline** — camera frame to structured vital sign extraction in a single inference call
- Designed the **4-tier clinical alert engine** with baseline-relative, trend-aware reasoning and 9 configurable alarm thresholds
- Developed the **full-stack application** — FastAPI backend + zero-dependency vanilla JS frontend (5,400+ lines, no framework)
- Implemented **Gemini Live API voice I/O** for real-time bidirectional audio via WebSocket
- Built **streaming AI copilot chat**, clinical risk prediction, fluid balance tracking, drug log, and setting-adaptive event timeline
- Created the **medical-grade dark UI** with 8 vital cards, 4 trend graphs, ROI camera viewport, and one-click PDF export
- **Dockerized and deployed** to GCP Cloud Run with Cloud Build CI/CD

*Dr Bhavna Gupta provided all clinical domain expertise — medical knowledge, workflows, and the core insight that the camera is the universal monitor interface.*

Aayush Gupta
## My Contribution — Dr Bhavna Gupta

**Consultant Anaesthesiologist & Intensivist** | Clinical domain expertise & product vision

- Identified the **core clinical problem** — cognitive overload causing 23–30% of critical alarms to be missed across OR, ICU, ED, and PACU settings
- Conceived the **foundational insight**: the camera is the universal medical device interface — bypassing proprietary monitor integrations entirely
- Defined **clinical workflows and alarm thresholds** (MAP < 65, SpO₂ < 92%, Shock Index > 0.9, etc.) grounded in real-world anaesthesia and critical care practice
- Designed the **patient intake taxonomy** — ASA classification, 30+ anaesthesia/sedation techniques (neuraxial, regional, general, MAC), comorbidity panels, and airway assessment
- Specified **setting-adaptive clinical event categories** across 8 environments (OR, ICU, ED, PACU, L&D, Procedural Sedation, Transport, Field Medicine)
- Authored the **4-tier alert escalation logic** (NONE → WATCH → CONCERN → CRITICAL) and clinical reasoning framework for differentials, checks, and suggested actions
- Validated **AI-generated clinical insights** — ensuring physiological explanations, risk predictions, and safety checklists are evidence-based and actionable
- Shaped the **safety and regulatory framing** — positioning Jarvis as clinical decision support, not autonomous diagnosis

*Aayush Gupta built the entire technical stack — AI pipeline, full-stack application, voice I/O, and cloud deployment.*

Bhavna Gupta
Dr. Bhavna Gupta I am an anesthesiologist, researcher, educator, and healthcare innovator.

Updates

Bhavna Gupta started this project — Mar 16, 2026 07:51 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.