Inspiration

Healthcare workers in rural clinics, home-care settings, and resource-limited hospitals often work alone — without a senior colleague to ask "Does this wound look infected?" or "Is this ECG concerning?" A single nurse in a remote village might be the only trained hand for 50 km, performing IV cannulations, wound dressings, and vital assessments with no real-time backup. Meanwhile, patients themselves struggle to understand their medications — "Can I take this with food?" "What are the side effects?" We built MediSense because we believed Gemini's multimodal Live API could become that missing colleague — one who can see through the camera, hear the nurse's voice, and respond instantly with clinical guidance, all in a natural, hands-free conversation.

What it does

MediSense is a real-time multimodal AI copilot for healthcare that operates in two modes:

Nurse Mode — A clinical assistant that watches live camera/screen feeds while conversing naturally via voice. It guides procedures step-by-step (IV cannulation, wound dressing, vitals assessment, catheter care, blood glucose monitoring), automatically verifying each step through the camera and flagging issues ("Flashback not visible in chamber", "Tourniquet applied too loosely"). It computes clinical risk scores (NEWS2, qSOFA), raises urgent alerts for critical findings, generates AI-powered differential diagnoses, produces SBAR handover notes, and maintains a timestamped clinical log — all hands-free.

Patient Mode — A gentle medicine companion that identifies pills through the camera, explains dosages and side effects in plain language, checks drug interactions, and provides reassurance — always reminding patients to consult their doctor for any changes.

Both modes support live voice conversation, camera/screen sharing, image uploads (X-rays, lab reports), AI-generated medical illustrations, and a full clinical dashboard with ESR trends, vital signs charts, lab results, and visit timelines.

How we built it

The core is Gemini 2.5 Flash Native Audio via the Multimodal Live API — a persistent bidirectional stream that processes voice audio (16kHz PCM, 500ms chunks) and camera frames (768×768 JPEG, 2-second intervals) simultaneously, responding with natural speech (24kHz Aoede voice). We use function calling within the Live stream so Gemini can autonomously log clinical notes, raise urgent alerts, update procedure checklists, and trigger image generation — all mid-conversation without breaking the audio flow.

For medical illustrations, we integrated Gemini 2.5 Flash Preview Image Generation (Nano Banana) as both an AI-callable tool and a user-triggered feature, producing labeled anatomical diagrams, wound care guides, and injection technique visuals on demand.

The backend is Flask + Flask-SocketIO (Python), managing async Gemini Live sessions with automatic context-window compression (80k-token sliding window), session resumption handles for network recovery, and frame-drop handling to prevent memory overflow. The frontend is vanilla JavaScript with Tailwind CSS and Chart.js, structured as a 3-page SPA (Setup → Chat → Dashboard). Five realistic synthetic patient records with full clinical histories (vitals, labs, imaging, visit notes, medications, allergies) provide rich demonstration data. Deployed on Google Cloud Run with a Dockerized gunicorn setup.

Challenges we ran into

Multimodal stream synchronization was the hardest problem. Audio, video frames, and text all flow through one Gemini Live connection, but arrive at different cadences — audio every 500ms, video every 2s, text sporadically. Getting barge-in (interrupting the AI mid-sentence) to work cleanly without audio artifacts required careful queue management and a SessionBridge pattern with overflow detection.

Context window limits were a real constraint for clinical use. A 30-minute nursing session generates massive context (patient records + conversation + procedure steps + clinical notes). We built automatic sliding-window compression that trims context at 100k tokens down to 80k while preserving the most recent and most critical information.

Function calling inside a Live stream behaves differently from standard Gemini API calls. Tool responses must be sent back through the same stream, and errors in tool handling would crash the entire audio session. Getting generate_visual_aid to work (which calls a separate Gemini model mid-stream) required async orchestration between two simultaneous Gemini connections.

Clinical safety guardrails required extensive prompt engineering. The AI must be helpful enough to guide a procedure but never overstep — never prescribe, never adjust dosages, always escalate. Balancing helpfulness with safety for both nurse and patient modes was an iterative process.

Accomplishments that we're proud of

1) Hands-free procedure verification: A nurse can perform an IV cannulation while MediSense watches through the camera and automatically marks each step as verified/warned/flagged — no touching the screen required.

2) Sub-second voice response: The entire loop (nurse speaks → Gemini processes audio + video → AI responds in natural speech) happens in under a second, enabling truly natural clinical conversation.

3) Autonomous clinical reasoning: The AI proactively computes NEWS2 and qSOFA scores from loaded patient vitals, raises urgent alerts when thresholds are breached, and generates structured differential diagnoses and SBAR handover notes — without being asked.

4) Dual-mode architecture: Switching between Nurse and Patient modes swaps the entire system prompt, UI, and available features, making one platform serve two very different user populations.

5) AI-generated medical illustrations: A nurse can say "Show me the proper angle for IV insertion" and receive a labeled anatomical diagram seconds later, inline in the chat.

What we learned

1) Multimodal Live API is transformative for healthcare: The ability to process voice + camera simultaneously with sub-second latency fundamentally changes what's possible. Healthcare workers need their hands free — this API makes hands-free AI assistance real.

2) Clinical safety requires defense in depth: System prompts alone aren't enough. We needed prominent UI disclaimers, auto-escalation thresholds, allergy highlighting at every layer, and mode-specific guardrails.

3) Synthetic patient data is essential for healthcare demos: Realistic patients with trending labs, imaging reports, and visit histories make the difference between a toy demo and a convincing clinical tool.

What's next for MediSense

1) EHR Integration: Connect to real Electronic Health Record systems (FHIR/HL7) to pull live patient data instead of synthetic records.

2) Offline-first mode: Cache procedure checklists and basic clinical guidance for areas with intermittent connectivity, syncing clinical logs when back online.

3) Wearable device integration: Stream vitals directly from pulse oximeters, BP monitors, and glucometers instead of manual entry.

Built With

Share this project:

Updates