Inspiration

Same clinic visit, same patient — the audio and the labs belong together. Shipping two separate products would throw away the link. So we didn't.

Second seed: we capped ourselves at small, laptop-runnable models. A trillion-parameter model gives you the illusion of a great system. A small one that works tells you you've built a real one.

What it does

One playground, three modes:

  • Scribe — audio → clinician report (JSON + PDF).
  • Lab — PDF → patient-friendly explanation with severity tiers and next steps.
  • Combined — both inputs, both outputs, cross-referenced. The clinician's lab section comes from the real PDF. The patient's narrative is grounded in what the doctor actually said.

Every stage is inspectable — transcript, intermediate JSON, validation tool calls, sampler knobs per stage. For clinical AI, trust comes from looking inside, not from a glossy PDF.

Works natively in EN / FR / AR / VN. Auto-detected, generated in-language, not translated after the fact.

How we built it

The spine is Ingestion → Extraction → Generation (+ Validation). The stance we converged on: LLM-advisory, Python-authoritative — the model proposes, Python owns shape, gating, and severity. Tuning is the split.

Things worth calling out:

  • Qwen for every LLM call. qwen3.5:9b for text, qwen2.5vl:7b for scanned-PDF OCR. Same OpenAI-compatible code runs against Ollama locally or DashScope in the cloud. (Whisper + pyannote do the audio preprocessing.)
  • Event-sourced runs, no runs table. Each run is a directory with an append-only events.jsonl; state is a pure fold. Free interrupt detection, smart resume, replay.
  • DSPy mid-hackathon rescue. Our 17-field generator kept collapsing into degenerate reasoning loops. DSPy's field-by-field adapter forced local targets per field and unstuck the hang — without us rewriting a single prompt.
  • Mirostat for extraction stability. Top-p/top-k couldn't catch alternating-sentence loops on long outputs. Mirostat (tau=4.0) targets constant perplexity. One knob, one class of failures gone.
  • ReAct validator with 5 tools — OpenFDA, LOINC, ICD-10-CM (NLM), MedlinePlus, RAG. Every tool call renders as a badge in the UI: the auditable trail from "the LLM said X" to "the API confirmed X."
  • Severity in two stages. Python picks which rows to surface (all abnormals + flagship tests). A dedicated LLM call assigns tier and writes the explanation. The model never chooses what's shown, only how to describe it.
  • Matrix-gated iteration. Every change had to clear all three modes end-to-end plus a dimension review using Elfie's own rubric. No vibes, no proxy metrics.

Challenges we ran into

  • Evaluation flip-flop. Wins on one example silently regressed another. Fixed by scoring every change against a frozen matrix of runs.
  • Degenerate generation loops. The trigger for going DSPy.
  • Multilingual ICD-10-CM. NLM only speaks English, the extractor was handing us FR/AR/VN strings. A three-layer fix (prompt → regex gate → translate-before-lookup) cut validation cost 6–16×.
  • Generation rewriting extraction flags. The report LLM's prose reasoning quietly overrode deterministic severity flags. Fix: re-apply severity heuristics after prose, not before.
  • The lesson we kept relearning: if a rule matters for correctness, stating it in the prompt is necessary but never sufficient. Enforce it in code.

Accomplishments that we're proud of

  • Both challenges, one unified pipeline.
  • Mirostat stabilization — one config vector, one whole class of failures eliminated.
  • Multilingual ICD-10-CM pathway: 6–16× cheaper, still correct across four languages.
  • Event-sourced run store with resume/replay and no status-column drift.
  • A UI that shows its work instead of hiding it.

What we learned

  • Structure beats wording. Swapping to DSPy's field-by-field format fixed problems that better prompts couldn't.
  • Trust = inspectability. For clinical AI, surfacing tool calls and intermediate JSON matters more than polish.
  • Small models force honesty. You can't paper over bad pipeline design with more parameters.
  • Keep a rejection log. Our tuning/ notes saved us from re-deriving the same failures twice.

What's next for Eir

  • Clinician-in-the-loop edits fed back as labelled corrections.
  • Grounded patient chat on top of the summary.
  • Broader language coverage with proper per-language rubrics.
  • Distillation to an edge-deployable Qwen variant.

Built With

  • dspy
  • fastapi
  • pdfplumper
  • pyannote
  • qwen
  • react
  • tailwind
  • whisper
Share this project:

Updates