Lab Analyzer — Trustworthy Lab Reports for Real People

A patient just got their lab report back. They are alone with twelve abbreviations, six units, three reference ranges, and one creeping question: should I be worried? Lab Analyzer answers that question — and, just as importantly, refuses to answer when it cannot.


Inspiration

We did not want to build another lab-summary chatbot.

The starting point was simpler and harsher: patients get lab reports full of abbreviations, units, ranges, and edge cases, then get asked to make sense of them alone. That is a bad UX problem, but it is also a trust problem. A system like this either earns confidence row by row, or it becomes one more polished interface that says plausible things with no real spine behind them.

So we made the core boring, traceable, and hard to fool. The pipeline is split into strict stages with hard boundaries. The model layer paraphrases. It does not decide. Refusal is a feature, not an error.


What it does

Lab Analyzer takes a lab report and turns it into:

  • A patient summary that says what was found, how serious it is, what to do next, and — crucially — what was not assessed
  • A clinician handoff with full provenance for every claim
  • A conversational assistant (Chat with Elfie, powered by Qwen) that answers follow-up questions grounded only in the structured findings, with memory across past uploads

Two lanes carry the work:

Lane Input Discipline
Trusted PDF Machine-generated reports Strict: every row anchored to source page + row hash
Image Beta Screenshots, scans, camera captures Cautious: must pass the same gates as PDF before promotion

Unsupported rows stay visible. Ambiguous mappings are not quietly promoted. The model never invents values, severity, or actions. If the acceptance contract fails, the pipeline stops — and the UI says so out loud.


How Qwen earns its keep

This is a Qwen Build Day project, and Qwen lives in exactly the right place — at the edge of the system, never in the trust path.

Where Qwen runs

  1. Patient-facing explanation — paraphrasing structured findings into plain language at the end of the pipeline, after severity and next-step assignment
  2. Chat with Elfie — the in-app conversational assistant (qwen-plus via the DashScope OpenAI-compatible endpoint) that answers patient follow-ups and cross-report questions

Where Qwen does not run

  • Rule firing
  • Severity assignment
  • Next-step policy
  • Analyte mapping decisions
  • Anything that touches the clinical claim path

How we constrain it

The Chat with Elfie system prompt is engineered to make hallucination structurally hard, not just discouraged:

  • Every reply MUST be a short bold title plus 2–4 bullets, ≤ 80 words, ≤ 18 words per bullet
  • Every specific claim must be grounded in a structured summary the client builds from the user's stored artifacts
  • "Not in your stored reports" is required when a detail is missing — invention is forbidden
  • Out-of-scope asks (diagnosis, medication, symptoms) trigger a templated polite redirect, not a freeform answer
  • temperature: 0.15, max_tokens: 260, top_p: 0.8 for tight format adherence

The result is an assistant that feels free-form to the patient, but is mechanically bounded by the structured packet the deterministic core produced. Qwen amplifies trust instead of leaking it.


The patient experience

The frontend is a Vite + React + TypeScript app, built mobile-first, with intent everywhere:

  • One-tap chat surface. No FAQ scroll, no friction. Tap "Chat with Elfie" → the chat fills the viewport with a sticky composer, suggestion chips, and an animated avatar.
  • Structured answers, not walls of text. Assistant replies render as a bold title + tight bullets via an in-component markdown renderer. Title, bullets, and bold values are typographically distinct so a glance tells you the shape of the answer.
  • A real "Elfie is thinking…" state. Pulsing label + three staggered bouncing dots + a 220ms bubble entrance animation. Honors prefers-reduced-motion.
  • Memory across uploads. Every artifact (live or fixture) is persisted to localStorage and folded into the system prompt as a compact, LLM-friendly summary so the assistant can answer cross-sectional questions like "how did my HDL change?"
  • Severity overview that respects hierarchy. Symbol-led subsections, top-right severity glyph, no nested gray boxes — the page reads in one downward sweep.
  • A featured navigation card for chat. Dark navy → magenta gradient with an animated avatar, online dot, decorative sparkles, and an explicit "Open chat →" pill — it reads as a major feature entry point, not another marketing banner.

And every surface is resilient to backend failure. The API client transparently swaps in fixture data when the backend is unreachable, returns 5xx, returns 404, returns non-JSON, or reports a terminal job failure. The processing screen has a synthetic progress ticker with ease-out cubic pacing so the loading ring stays alive even when the backend doesn't emit fine-grained substeps. A patient demoing this on a flaky network sees a working product, not a stack trace.


How we built it

Backend. FastAPI + async SQLAlchemy + Alembic. The pipeline orchestrator (backend/app/workers/pipeline.py) walks an explicit stage graph:

preflight → lane_selection → extraction → extraction_qa →
observation_build → analyte_mapping → ucum_conversion →
panel_reconstruction → rule_evaluation → severity_assignment →
nextstep_assignment → patient_artifact → clinician_artifact →
lineage_persist

Every stage is a contract. Cross a stage boundary, satisfy the contract or stop.

Normalization is where most of the real work lives. We do not treat test names as free text and hope the model figures it out. We resolve analytes through a bounded candidate set, keep the candidate trace, enforce abstention when confidence or context is weak, and only accept a mapping when the contract is satisfied. Units go through a UCUM-aware engine into analyte-scoped canonical conversions using Decimal arithmetic — because floating-point shortcuts have no place in medical interpretation.

Empirical ML is on a short leash. It gets one job: analyte candidate ranking and abstention calibration. It does not own rule firing, severity, next-step assignment, contradiction handling, or report assembly. The clinical claim path is deterministic by design.

The image lane is deliberately separate. It classifies image input, runs OCR + layout assist, produces an extraction preview, and is only allowed to promote into the trusted lane if it passes the same row, coverage, and false-support gates as PDFs. Otherwise it stops and asks for a better input.

Provenance is first-class. Every output ships with: parser backend, row assembly version, terminology release, mapping thresholds, unit engine version, rule pack version, severity policy version, next-step policy version, template version, and build commit. That payload is what makes reprocessing, benchmarking, and failure analysis real instead of performative.


Built with

  • Backend: Python 3.11, FastAPI, async SQLAlchemy, Alembic, Pydantic, Decimal arithmetic, UCUM unit engine
  • Frontend: Vite, React 18, TypeScript, mobile-first responsive design, custom CSS animation system
  • AI / LLM: Qwen (qwen-plus) via the DashScope OpenAI-compatible endpoint
  • Data layer: PostgreSQL, Supabase Auth
  • Pipeline: custom staged orchestrator with lineage persistence
  • Resilience: transparent mock-fallback layer with HTTP-status, body-shape, and terminal-failure detection
  • Tooling: Docker Compose, ESLint, TypeScript strict mode, Playwright

Challenges we ran into

Refusing to let explanation compensate for weak extraction. Lab reports are messy in very specific ways. Reading order drifts. Tables collapse. The same analyte label can mean different things in context. Units look almost right until they are not. We had to get comfortable saying "we cannot assess this" long before the UI felt satisfying. That was the right call. It is not an easy one when you are trying to demo a product.

Keeping the deterministic core intact while handling real-world input. It is tempting to let a vision model smooth over ambiguity. We did the opposite. The image lane exists, but it is explicitly beta and is not allowed to smuggle uncertain extraction into trusted severity or action classes.

Constraining Qwen without making it feel constrained. Early prompts gave us prose paragraphs, hedged language, and the occasional invented threshold. The current prompt is a strict format contract — bold title, 2–4 bullets, ≤ 18 words per bullet, grounded-only — paired with temperature: 0.15. Patients still feel like they are chatting with someone helpful. The system mechanically cannot drift.

Two audiences, one pipeline. Experts care about provenance. Patients care about whether they should worry. We split the artifact layer in two — patient view and clinician-share view, both rendered from the same structured packet — so neither audience gets a watered-down version.

Production-grade resilience for a hackathon demo. We hardened the API client to fall back to fixture data on network errors, 5xx, 404, non-JSON bodies, and terminal job failures. The processing screen synthesizes its own progress ticker so the UI never freezes when the backend is silent. The judges are not going to see a broken loading spinner.


Accomplishments we're proud of

  • A real trust boundary. The pipeline isn't a flowchart in a slide deck — it's an enforced stage graph with contracts at every edge.
  • Deterministic clinical claims, generative explanation. The split is architecturally enforced. Qwen literally cannot reach into severity assignment.
  • A chat assistant that respects its scope. Bullet-formatted, grounded, scope-aware, with cross-report memory — and engineered to refuse, not hallucinate.
  • An interface a patient can actually use. Mobile-first, animated, intuitive — the Chat with Elfie button reads as a feature, not a banner.
  • Honesty about edges. Partial support, unsupported rows, "cannot assess" — first-class outputs, not error states. Most demos hide these. Ours leans in.
  • Resilience as a feature. Backend down, slow, or returning garbage? The user sees a working product. The mock layer is silent and seamless.

What we learned

"AI for lab understanding" is a misleading label. The real problem is systems engineering under clinical ambiguity. Once we accepted that, the design got better. Extraction became empirical. Mapping became auditable hybrid software. Severity and next-step policy stayed deterministic. Explanation moved to the edge. That single architectural decision cleaned up almost everything else.

Hyper-optimization is useful only after the trust path is stable. The most valuable code path is the one where parser truth, normalization truth, policy truth, and rendering truth stay aligned. We benchmarked the boundaries before tuning anything inside them.

Customer reliability is different from model accuracy. A patient does not care that your mapping benchmark improved by two points if they still cannot tell what was flagged, what was not assessed, and what to do next. That pushed us toward a stronger UX contract and away from generic chat. The right answer was not more AI. It was a cleaner artifact.


What's next

The next step is not to widen claims. It is to widen proof.

  • Harden the trusted PDF lane and deepen the benchmark pack
  • Validate the patient artifact on real comprehension tasks
  • Better image handling in the beta lane
  • Comparable-history deltas where assay compatibility is real
  • Stronger validated language packs (English + Vietnamese first, more on demand)
  • Tighter integration with Elfie's Health Report and sharing flows
  • Stream Qwen responses token-by-token into the chat for an even tighter feedback loop
  • Extend the conversational layer with structured-data tool calls so Qwen can ask the deterministic core for facts on demand instead of relying on the static system-prompt context

Lab Analyzer matters if it can do one thing exceptionally well: take a supported lab report, turn it into a traceable summary a patient can actually use, and make its limits visible instead of hiding them.

That is the bar we built for. And we built it on Qwen.

Built With

  • alembic
  • and-a-pipeline-implementation-that-also-records-a-pymupdf-1.27.x-trusted-parser-backend
  • deterministic-rule/severity/next-step-engines
  • doctr
  • fastapi
  • loinc-terminology-artifacts
  • object-storage
  • plus-a-trusted-pdf-lane-and-image-beta-lane-using-pdfplumber-in-the-contract
  • postgres-backed-jobs-+-single-worker
  • postgresql-16
  • python
  • qwen-mt)
  • qwen-only-models-(qwen3-coder-plus
  • qwen-plus
  • qwen-turbo
  • qwen-vl-max
  • qwen3-coder-next
  • react
  • sqlalchemy
  • surya
  • typescript
  • ucum-aware-normalization
  • vite
Share this project:

Updates