Inspiration
Large language models are increasingly being used in healthcare applications, but they often struggle with longitudinal reasoning, especially when interpreting lab values across multiple timepoints. During experimentation with clinical-style summaries, we observed that language models could incorrectly infer trends, overstate conclusions, or hallucinate clinical interpretations when analyzing sequential patient reports.
This project was inspired by the idea that clinical AI systems should not rely entirely on probabilistic language generation for temporal reasoning. Instead, deterministic trend analysis and structured reasoning should guide the explanation layer.
We chose thyroid lab analysis as a focused use case because thyroid markers such as TSH, T3, and T4 naturally evolve over time and require careful interpretation of trends rather than isolated values.
What it does
The system analyzes longitudinal thyroid lab reports and generates evidence-aware clinical summaries using a hybrid reasoning approach. The application:
- extracts and organizes thyroid lab values chronologically
- detects improving, worsening, stable, and fluctuating patterns
- identifies meaningful longitudinal changes
- assigns risk/concern levels
- generates cautious, structured clinical explanations
- reduces hallucinated reasoning by separating deterministic analysis from language generation The system is integrated into Prompt Opinion through MCP (Model Context Protocol).
How we built it
Patient Reports ↓ Deterministic Pattern Engine ↓ Evidence Layer ↓ Controlled LLM Explanation ↓ Structured Clinical Output
Core Components
- Pattern Detection Engine We implemented deterministic logic in Python to analyze:
- directional trends
- threshold crossings
- fluctuating behavior
- improving/worsening tendencies This prevents the language model from independently inferring numerical trends.
- Evidence Layer An evidence-aware layer was added to provide cautious interpretation guidance and reduce unsupported conclusions.
- Controlled Prompting The LLM was constrained through structured prompting rules: -avoid hallucinated diagnoses
- avoid speculative causes
- avoid treatment recommendations
- focus only on observable data patterns
- explicitly acknowledge uncertainty
- MCP Integration The system was deployed on Render and exposed as an MCP server integrated with Prompt Opinion.
Challenges we ran into
One of the biggest challenges was realizing that LLMs alone were unreliable for longitudinal numerical reasoning. Early versions of the system produced inconsistent interpretations, especially for fluctuating lab patterns.
Another challenge was distinguishing between:
- true directional progression
- fluctuating instability
- isolated abnormal values We also encountered integration and deployment challenges while exposing the analysis engine through MCP and testing model behavior within Prompt Opinion.
Additionally, we had to carefully control prompt behavior to reduce:
- overconfident wording
- hallucinated symptoms
- unsupported diagnoses
- speculative clinical recommendations
Accomplishments that we're proud of
- Built a working MCP-integrated clinical AI tool deployed on Render and connected to Prompt Opinion.
- Designed a hybrid reasoning system that separates deterministic numerical analysis from language-model explanation.
- Successfully reduced hallucinated trend interpretation by preventing the LLM from independently inferring longitudinal patterns.
- Implemented structured detection for improving, worsening, stable, and fluctuating thyroid trends across multiple timepoints.
- Added an evidence-aware interpretation layer focused on cautious clinical language and uncertainty handling.
- Created a system that avoids unsupported diagnoses, speculative causes, and treatment recommendations.
- Successfully tested the system on multiple synthetic patient scenarios with distinct longitudinal patterns. Learned how to integrate MCP tooling, deployment workflows, and controlled prompt engineering into a healthcare-oriented AI application.
What we learned
This project reinforced that reliable clinical AI systems require hybrid architectures rather than relying entirely on generative models. We learned:
- deterministic reasoning improves reliability
- longitudinal analysis is significantly harder than single-report summarization
- prompt constraints are critical in healthcare-oriented AI systems
- separating reasoning from explanation reduces hallucination risk We also gained hands-on experience integrating MCP-based tooling with Prompt Opinion.
What's next for Evidence-Aware Thyroid Trend Analyzer
Future directions include:
- support for additional laboratory markers
- broader longitudinal patient analysis
- retrieval-backed evidence grounding
- structured confidence scoring
- integration with richer clinical datasets The long-term goal is to explore safer approaches for clinical AI systems that combine deterministic reasoning with language-model explainability.
Log in or sign up for Devpost to join the conversation.