Inspiration
Physicians in developing countries like Pakistan, India, and Sub-Saharan Africa are often the only doctor serving thousands of patients. Despite this crushing load, they spend 35–40% of their time on documentation — typing the same structured data into Electronic Health Record (EHR) systems, over and over, instead of treating patients.
We asked: what if an AI agent could listen to a doctor describe a patient, see the uploaded scans and lab reports, reason through the clinical picture, and then act on the EHR system automatically — all in real time?
That question became MediNova.
What it does
MediNova is a multimodal, multi-agent clinical intelligence system built entirely on Amazon Nova. A doctor speaks a patient case description in natural language. The system simultaneously:
- Transcribes and understands the voice input via Amazon Nova 2 Sonic (speech-to-speech, with crossmodal text/voice switching)
- Retrieves similar past cases using Amazon Nova Multimodal Embeddings — searching a knowledge base using both the spoken description and an uploaded ECG/X-ray image as a unified cross-modal query
- Reasons through differential diagnoses, flags drug interactions, and identifies missing workup items using Amazon Nova 2 Lite with extended thinking enabled at medium budget — the reasoning trace is surfaced to the doctor in real time
- Automatically fills the EHR encounter form on the hospital's web portal using Amazon Nova Act, with >90% task reliability at scale
- Reads the clinical summary back to the doctor via Nova 2 Sonic voice output, completing the full voice-in, voice-out loop
The result: a complete clinical encounter — from spoken description to structured EHR entry — in under 60 seconds, with the AI's full reasoning chain visible and auditable.
How we built it
Architecture
MediNova uses a three-agent orchestration pattern built on the Strands Agents SDK deployed to Amazon Bedrock AgentCore:
Agent 1 — Intake Agent (Nova 2 Sonic + Nova Multimodal Embeddings)
- Receives bidirectional audio stream via LiveKit integration (full-duplex, voice activity detection included)
- Processes crossmodal input: spoken case description + uploaded images (ECG, X-ray, lab PDFs)
- Generates unified embeddings via Nova Multimodal Embeddings and queries an Amazon OpenSearch Service vector index of 500+ synthetic past cases — retrieving the top-3 similar cases by image+text combined similarity
Agent 2 — Reasoning Agent (Nova 2 Lite)
- Model ID:
us.amazon.nova-2-lite-v1:0 - Extended thinking enabled at
mediumbudget — exposes the step-by-step reasoning trace - Uses built-in web grounding tool to pull live clinical guidelines (CDC, WHO) as context during reasoning
- Uses built-in code interpreter to run basic statistical risk scoring (HEART score, Wells criteria)
- Returns structured JSON: primary diagnosis, differential, risk flags, recommended workup, proposed EHR note
Agent 3 — Automation Agent (Nova Act)
- Receives structured output from the reasoning agent
- Opens the EHR web interface and autonomously fills: Chief Complaint, HPI, Assessment & Plan, and Orders fields
- Deployed via Nova Act IDE extension → Amazon ECR → Bedrock AgentCore Runtime
- Human-in-the-loop escalation built in: any field with confidence < 0.85 is flagged for doctor confirmation before submission
Stack
- Orchestration: Strands Agents SDK with
swarm,use_agent, andthinktools - Voice: Amazon Nova 2 Sonic (
amazon.nova-2-sonic-v1:0) + LiveKit Agents framework - Reasoning: Amazon Nova 2 Lite with
reasoningConfig: enabled, maxReasoningEffort: medium - Multimodal search: Nova Multimodal Embeddings + Amazon OpenSearch Serverless
- UI automation: Amazon Nova Act (playground → VS Code extension → Bedrock AgentCore)
- Storage: Amazon S3 for scan uploads, Amazon S3 Vectors for embedding store
- Backend: Python 3.12, FastAPI
- Frontend: React + LiveKit JS SDK for the voice interface
Challenges we ran into
Cross-modal embedding alignment: Getting meaningful similarity scores when the query is a spoken sentence but the indexed content contains both text and image embeddings required careful normalization of the embedding space. Nova Multimodal Embeddings handles this natively through its unified vector space, but tuning the retrieval threshold for clinical relevance (where a false positive is dangerous) required careful calibration.
Extended thinking latency vs. UX: When Nova 2 Lite's extended thinking is enabled at medium budget, the reasoning trace can take 8–12 seconds on complex cases. We solved this by streaming the thinking text to the UI in real time — so the doctor watches the AI think rather than staring at a loading spinner. This turned a UX problem into a feature.
Nova Act reliability on dynamic EHR forms: Real EHR interfaces are notoriously inconsistent — fields appear conditionally, dropdowns load asynchronously, and session timeouts are aggressive. We used Nova Act's notebook-style builder in the IDE extension to test and harden each step individually, and built in retry logic with human escalation for steps that failed twice.
Voice + reasoning latency in a single pipeline: Chaining Nova Sonic (real-time stream) → Nova 2 Lite (reasoning, ~10s) → Nova Act (browser automation, ~20s) → Nova Sonic (output) creates a ~35 second end-to-end pipeline. We parallelized the intake and EHR pre-loading steps to bring perceived latency under 20 seconds for the doctor.
Accomplishments that we're proud of
- First known use of Nova Multimodal Embeddings for cross-modal clinical search — querying a patient database with an ECG image as the input, not text keywords
- Full four-model Nova integration in a single coherent system: Sonic + 2 Lite + Act + Multimodal Embeddings
- Auditable AI reasoning — the extended thinking trace is stored per encounter as a compliance artifact, answering the "why did the AI say that?" question that blocks clinical AI adoption
- Sub-60-second full encounter loop from voice description to completed EHR entry
- Polyglot support via Nova 2 Sonic — the same system works in English, Urdu/Hindi, Spanish, and Portuguese without model switching, making it viable for global deployment
What we learned
- Amazon Nova 2 Lite's built-in web grounding and code interpreter tools are dramatically underutilized — combining them with extended thinking creates a reasoning agent that can cite sources and verify its own outputs
- Strands Agents'
swarmprimitive makes it trivially easy to spawn parallel sub-agents for concurrent tasks — we used this to run EHR pre-population in parallel with the reasoning trace - Nova Act's "web gym" for testing agents is genuinely one of the most thoughtful developer experiences in the agentic AI ecosystem — prototype in the playground, harden in the IDE, ship to production in one click
- The crossmodal feature in Nova 2 Sonic (switching between text and voice mid-session) unlocks interaction patterns that no prior voice AI system supported — a doctor can type a complex medication name and speak everything else
What's next for MediNova
- Integration with real EHR systems: Epic and OpenMRS (open-source, used across the developing world) are the first targets
- Fine-tuning on clinical data: Nova 2 Lite's customization support on Amazon Bedrock and SageMaker AI means we can fine-tune on de-identified clinical notes for specialty-specific reasoning (cardiology, emergency medicine)
- Fleet deployment: Nova Act's ability to manage fleets of agents means one MediNova instance could handle documentation for an entire hospital ward simultaneously
- Regulatory pathway: Pursuing FDA 510(k) exemption as a clinical decision support tool (not a diagnostic) — the auditable reasoning trace is specifically designed to support this classification
- Community health workers: A simplified voice-only version for community health workers in rural Pakistan and India who have smartphones but no formal medical training — Nova Sonic's Hindi and Urdu support makes this viable today
Log in or sign up for Devpost to join the conversation.