About MedNova

What Inspired This

I've had a few conversations with doctors over the years and one thing that comes up consistently isn't about diagnosis or treatment. It's about the sheer administrative weight of the job. Charting. Looking things up. Cross-referencing. Writing the same information in three different places in three different formats.

The handoff note in particular stuck with me. At the end of a shift, a physician has to hand over their patients to the incoming team. They're tired. They're doing this quickly. And they're writing a summary largely from memory, pulling from whatever they can still hold in their head after twelve hours. The EHR has all the data. The physician still has to do the work of synthesizing it manually.

That felt like a gap worth trying to close. Not with prediction or automation, but with something simpler: a system that could answer a specific question about a specific patient using data that already exists, without making anything up.

That's what we tried to build.


How We Built It

We started with the MIMIC-IV Clinical Database Demo, a de-identified set of real ICU records from Beth Israel Deaconess Medical Center, covering 100 patients. It's the standard public benchmark for clinical NLP and informatics research. It has structured tables for admissions, diagnoses, lab results, vitals, medications, and unstructured discharge summaries and radiology reports. That combination of structured and unstructured data made it the right substrate for what we wanted to test.

The first thing we built was the ETL pipeline. Raw MIMIC CSVs go in, a clean PostgreSQL schema comes out: six tables with proper indexes and anomaly flags on vitals. The anomaly detection is rule-based: a lookup table of clinical thresholds (heart rate, blood pressure, O2 saturation, temperature, respiratory rate) and a boolean flag on each row. Simple, but it means the downstream query layer doesn't have to recompute it every time.

Vitals flagged as anomalous where:
  Heart Rate        <  50  or  > 120 bpm
  BP Systolic       <  90  or  > 160 mmHg
  Respiratory Rate  <  10  or  >  28 /min
  O2 Saturation     <  90%
  Temperature       <  96  or  > 101.5 °F

The backend is FastAPI. The AI layer runs on Amazon Bedrock, specifically Amazon Nova Lite for the reasoning model and Amazon Titan Multimodal Embeddings for embedding clinical notes into OpenSearch Serverless for vector search.

The core of the system is an orchestrator that implements a ReAct loop: reason, act, observe, repeat. The LLM decides what tool to call, the tool runs against the database, the result comes back as a typed object, and the LLM decides what to do next. We capped it at 12 steps, which in practice is never reached. Most queries resolve in 2 or 3.

The piece we spent the most time on was the streaming architecture. Clinical queries often produce multiple outputs: a chart, a comparison table, a set of alerts, and then a text summary. We wanted these to arrive in the chat stream in order, not all at once at the end. That required a metadata event layer running alongside the token stream over Server-Sent Events. The frontend receives two interleaved event types: metadata (for charts, comparisons, alert payloads) and token (for the text). Getting this to feel fluid took longer than we expected.

The frontend is React with Recharts for native chart rendering. We made a deliberate decision early on to not use matplotlib or any server-side image generation. Every visualization is a ChartSpec JSON object emitted by the backend and rendered in the browser. Faster, no file I/O, and the charts are interactive.


What We Learned

The most important thing we learned was about the failure mode we were most afraid of, and how to prevent it architecturally.

LLMs are extraordinarily good at sounding correct. In a general-purpose context that's mostly fine. The cost of a plausible-sounding but wrong answer about, say, the capital of a country is low. In a clinical context the cost structure is completely different. A hallucinated creatinine value or a fabricated drug interaction warning can cause real harm.

The solution we landed on was to make it structurally impossible for the LLM to generate clinical numbers. Every patient-specific numeric value in the system flows through a tool call that returns a typed SQLQueryResult object. The LLM builds its response from that object. If the object is empty, the response says so. There is no path by which the model can substitute a plausible-sounding value from its training data.

The math behind why this matters is worth stating clearly. If we model a response as a sequence of $n$ clinical claims $c_1, c_2, \ldots, c_n$, and each claim has a hallucination probability $p_h$ when generated by the LLM directly, then the probability that at least one claim in the response is hallucinated is:

$$P(\text{at least one error}) = 1 - (1 - p_h)^n$$

For $p_h = 0.05$ (a generous assumption) and $n = 10$ claims in a single response, that's approximately a 40% chance of at least one fabricated value reaching the physician. Grounding every claim to a database row drives $p_h$ toward zero for structured data. The residual risk lives in the LLM's interpretation and synthesis of correct values, which is a much narrower and more manageable problem.

The second thing we learned was about alert fatigue as a design problem, not a data problem. We built a proactive monitor that scans for critical lab values and drug interactions after every response. First version: it surfaced everything it found, inline, in the text. It was overwhelming. We refactored it to render alerts as structured cards separate from the prose, capped at five per response, severity-ordered. The information was the same. The usability was completely different.

The third thing: intent classification without an LLM call is fast enough and good enough for routing. We spent time debating whether to use the LLM to classify the intent of each message before entering the main loop. The latency cost wasn't worth it. A 200-line heuristic function that checks for keywords and patterns ("compare", "chart", "all patients", specific patient IDs) correctly routes probably 85-90% of queries. The remaining cases get routed to a general path that works fine. Saving 400-600ms on every message is worth a lot in a tool people are using interactively.


The Challenges

The drug interaction problem was harder than expected. We used the OpenFDA label database, a free, well-maintained public API. The problem is that MIMIC drug names are often abbreviated, brand names, or institutional shorthand. OpenFDA indexes by generic name. "D5W" doesn't match "dextrose". "Vanc" doesn't match "vancomycin". Getting useful interaction data required normalization through RxNorm, which itself is another API call with its own matching issues. We got it working for common drugs, but the coverage is incomplete and we documented that honestly.

The MIMIC de-identification creates subtle gaps. Dates in MIMIC are shifted to protect patient identity. The shifts are consistent within a patient but not across patients, so you can compute relative time (days since admission, trend over an encounter) but not absolute time. This is fine for most queries but it means any feature that tries to correlate across patients by calendar time doesn't work the way you'd expect.

Multi-provider LLM consistency is a real problem we underestimated. The system supports six providers switchable at runtime. Late in the build we discovered that the same query to the same patient data could produce meaningfully different responses depending on which model was active. Not different values (those come from the database) but different clinical interpretations and emphasis. For a demo, that's fine. For a real deployment, you'd want to standardize on a single model that you've evaluated on clinical reasoning tasks and not let users switch freely.

The comparison tool's SQL is PostgreSQL-specific. The ANY(:ids) syntax and ILIKE operator don't exist in SQLite. We have a SQLite fallback in the config for development without a Postgres instance, but the comparison features silently break on it. We only caught this late and didn't have time to fix it cleanly.

Getting the streaming order right under load was tricky. Under normal conditions the metadata events (chart, comparison) arrive before the text tokens and the UI renders them in order. Under load, or with a slow provider, the token stream can start before the metadata event is processed. The frontend handles this with a message queue, but there were several hours of debugging edge cases where a chart would render in the wrong position in the conversation.


The Honest Part

We built something that works on a demo dataset. A hundred de-identified ICU patients from one hospital in Boston. That's a long way from a validated clinical tool.

The parts we're genuinely proud of: the grounding architecture, the streaming pipeline, the way the comparison tool builds a structured diff rather than just running two separate queries. These are engineering decisions we'd make the same way again.

The parts we'd do differently: we'd spend more time on the failure path. What happens when the system can't answer a question, and how it communicates that clearly enough that a physician knows what to do next. We'd invest in the drug normalization layer earlier. And we'd think harder about whether multi-provider support is actually a feature or a liability in a clinical context.

Regulatory compliance, clinical validation, EHR integration, institutional trust: none of these are things a hackathon can address. We tried not to pretend otherwise. The README has a table of everything that's deferred or incomplete. The blog post says clearly that this isn't ready for clinical use. That honesty felt important to maintain throughout.


MIMIC-IV data used under the PhysioNet Credentialed Health Data License. MedNova is a research prototype and is not intended for clinical use.

Built With

Share this project:

Updates