The Problem
Every day, healthcare runs on decisions that have to be specific, with traceable steps.
But most AI systems that extract diagnoses still behave like final-answer machines. You give them clinical text, they return a list of conditions, while the most important part stays hidden: how they got there. That black box is a problem in a domain where small differences in specificity can change the clinical story, where comorbidities matter, and where consistency across cases is essential for trust.
Diagnosis extraction is not a single-step task. It is a long-horizon process. Models have to move from broad context to precise conditions through a sequence of narrowing decisions. When that process is compressed into one output, it becomes hard to evaluate what the model truly understands, hard to pinpoint failure modes, and hard to improve the system in a targeted way.
Solution
TMSP is a framework that evaluates and develops medical language models through stepwise, constrained traversal of a clinical diagnosis pathway.
Using ICD-10 as a clinical diagnosis scaffold, the hierarchical ontology and its relationships impose real constraints on how to move around a clinical diagnosis pathway, forcing progression from general to specific, while encoding the lateral links and contributory conditions that matter in practice.
Instead of asking a model for final codes or final diagnoses directly, TMSP guides it through a sequence of candidate-selection steps. At each step the model:
- Selects relevant candidates from the next level of the hierarchy
- Provides structured reasoning for those selections
- Traverses both hierarchical and lateral relationships like codeFirst, codeAlso, useAdditionalCode, sevenChrDef
The result is not just an output. It is a traceable decision tree of what the model believed, when it believed it, and why.
Why It Matters
TMSP turns diagnosis extraction into something you can inspect, benchmark, and improve.
This is not just a black-boxed evaluation. It is a framework for developing medical language models that are transparent, testable, and tunable, where trust can naturally evolve from illuminating the path, and not just by taking a leap in the dark.
With TMSP you can:
- See exactly where a model diverges from an expected trajectory
- Detect undershooting and overshooting of specificity
- Catch missed contributory conditions through lateral relationships
- Measure consistency by running the same case multiple times
- Inject targeted feedback at a specific decision point to nudge the model towards better decisions
What I Built
TMSP ships with an interactive web app and a programmatic API. Built with Gemini-3-Pro, you can leverage all the Google Gemini Models in a secure VertexAI environment.
The web frontend includes:
- Visualize: Enter ICD-10 codes and view the minimal connected graph linking them, including lateral relationships.
- Traverse: Stream the stepwise traversal on a clinical note in real time, with reasoning captured at each decision.
- Benchmark: Compare traversal results against expected codes, with metrics for exact match, undershoot, overshoot, missed, and traversal recall.
Under the hood, TMSP uses:
- A traversal orchestrator to manage decision batches across the hierarchy
- A candidate selector with structured outputs and multi-provider support
- A graph index for ICD-10 hierarchy and relationship traversal
- Cacheable cross-run persistence and within-run consistency checks
It also supports a zero-shot mode for comparison, so you can directly evaluate how much stepwise constraints change outcomes versus a single-pass generation.
Tech Stack
- Backend: FastAPI with server-sent events for streaming traversals
- Frontend: React for visualization and interaction, AG-UI for UI event streaming
- Orchestration: Burr-based state machine for traversal control
- Data: ICD-10-CM index with hierarchical and lateral relationships
- Caching: SQLite persistence plus in-memory caching for repeated batches
- Providers: OpenAI, Anthropic, Vertex AI, Cerebras, SambaNova, with structured output support where available
What’s Next
TMSP is built to be more than a benchmark.
Planned additions include:
- Arena: Head-to-head benchmarking across models with per-step accuracy and reasoning comparisons
- Dataset generation: Export traversal traces into fine-tuning and RL-friendly formats
- Diagnosis querying: Integrate external clinical knowledge to propose alternative diagnoses and surface missed opportunities
Built With
- ag-ui
- burr
- python
- typescript
- vertexai
Log in or sign up for Devpost to join the conversation.