Test for Medical Stepwise Predictions

Architectural Design for Recursive LLM Candidate Selection
GIF
Visualize
GIF
Diagnosis Traversal
GIF
Feedback Injection
GIF
Unit test-like Benchmarking
GIF
Traversal Report

The Problem

Every day, healthcare runs on decisions that have to be specific, with traceable steps.

But most AI systems that extract diagnoses still behave like final-answer machines. You give them clinical text, they return a list of conditions, while the most important part stays hidden: how they got there. That black box is a problem in a domain where small differences in specificity can change the clinical story, where comorbidities matter, and where consistency across cases is essential for trust.

Diagnosis extraction is not a single-step task. It is a long-horizon process. Models have to move from broad context to precise conditions through a sequence of narrowing decisions. When that process is compressed into one output, it becomes hard to evaluate what the model truly understands, hard to pinpoint failure modes, and hard to improve the system in a targeted way.

Solution

TMSP is a framework that evaluates and develops medical language models through stepwise, constrained traversal of a clinical diagnosis pathway.

Using ICD-10 as a clinical diagnosis scaffold, the hierarchical ontology and its relationships impose real constraints on how to move around a clinical diagnosis pathway, forcing progression from general to specific, while encoding the lateral links and contributory conditions that matter in practice.

Instead of asking a model for final codes or final diagnoses directly, TMSP guides it through a sequence of candidate-selection steps. At each step the model:

Selects relevant candidates from the next level of the hierarchy
Provides structured reasoning for those selections
Traverses both hierarchical and lateral relationships like codeFirst, codeAlso, useAdditionalCode, sevenChrDef

The result is not just an output. It is a traceable decision tree of what the model believed, when it believed it, and why.

Why It Matters

TMSP turns diagnosis extraction into something you can inspect, benchmark, and improve.

This is not just a black-boxed evaluation. It is a framework for developing medical language models that are transparent, testable, and tunable, where trust can naturally evolve from illuminating the path, and not just by taking a leap in the dark.

With TMSP you can:

See exactly where a model diverges from an expected trajectory
Detect undershooting and overshooting of specificity
Catch missed contributory conditions through lateral relationships
Measure consistency by running the same case multiple times
Inject targeted feedback at a specific decision point to nudge the model towards better decisions

What I Built

TMSP ships with an interactive web app and a programmatic API. Built with Gemini-3-Pro, you can leverage all the Google Gemini Models in a secure VertexAI environment.

The web frontend includes:

Visualize: Enter ICD-10 codes and view the minimal connected graph linking them, including lateral relationships.
Traverse: Stream the stepwise traversal on a clinical note in real time, with reasoning captured at each decision.
Benchmark: Compare traversal results against expected codes, with metrics for exact match, undershoot, overshoot, missed, and traversal recall.

Under the hood, TMSP uses:

A traversal orchestrator to manage decision batches across the hierarchy
A candidate selector with structured outputs and multi-provider support
A graph index for ICD-10 hierarchy and relationship traversal
Cacheable cross-run persistence and within-run consistency checks

It also supports a zero-shot mode for comparison, so you can directly evaluate how much stepwise constraints change outcomes versus a single-pass generation.

Tech Stack

Backend: FastAPI with server-sent events for streaming traversals
Frontend: React for visualization and interaction, AG-UI for UI event streaming
Orchestration: Burr-based state machine for traversal control
Data: ICD-10-CM index with hierarchical and lateral relationships
Caching: SQLite persistence plus in-memory caching for repeated batches
Providers: OpenAI, Anthropic, Vertex AI, Cerebras, SambaNova, with structured output support where available

What’s Next

TMSP is built to be more than a benchmark.

Planned additions include:

Arena: Head-to-head benchmarking across models with per-step accuracy and reasoning comparisons
Dataset generation: Export traversal traces into fine-tuning and RL-friendly formats
Diagnosis querying: Integrate external clinical knowledge to propose alternative diagnoses and surface missed opportunities

Built With

ag-ui
burr
python
typescript
vertexai

Updates

Patrick Damaso MD started this project — Feb 09, 2026 06:26 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.