LEA: Building a Clinical Reasoning Safety Assistant Solo
Why I Built This
This project started from a problem I find deeply compelling: some of the most dangerous failures in medicine are not caused by lack of knowledge, but by failures in reasoning.
A clinician can know the right diagnoses, the right workup, and the right treatment pathways, and still get pulled off course by cognitive bias. Anchoring bias, availability bias, confirmation bias, and sunk cost fallacy are all examples of reasoning failures that can quietly turn into real patient harm.
I wanted to build something that could act like a reasoning debugger for clinical notes: a system that reads a note, identifies when the thinking may be drifting, points to where it happens, and generates safer, more reflective feedback.
But this project is also personal.
I chose to do this hackathon alone.
Not because I think solo work is better. Honestly, I think strong teams are usually optimal. I love collaborating, and I know good teams can go further than any one person can on their own. But I have never done a hackathon before, and for a long time I carried a lot of imposter syndrome around even trying one. I kept feeling like I needed to be more prepared, more experienced, or more “qualified” before I had earned the right to build something ambitious in this environment.
So this time I decided to do the opposite.
I wanted to prove to myself that I could sit down, start from scratch, and build something hard. I wanted to see whether I could stay with a difficult technical problem long enough to turn it into a real system. More than anything, I wanted to prove to myself that I belonged in rooms like this.
That became just as important to me as the actual project.
What I Built
I built LEA, a Clinical Reasoning Safety Assistant.
LEA takes a clinical note and produces a structured reasoning review. It:
- detects likely cognitive bias
- identifies the riskiest reasoning in the note
- explains why that reasoning may be unsafe
- proposes a safer rewrite
- supports follow-up questions and human correction
From a machine learning perspective, LEA became a multi-stage hybrid system rather than a single model.
At a high level, the pipeline looks like this:
- A bias classifier predicts the likely reasoning pattern
- A sentence localizer identifies the riskiest sentence
- A generator produces structured feedback and a safer rewrite
- A feedback and evaluation loop logs usage and corrections for future improvement
So the final product is not just an ML model. It is a full system with:
- training pipelines
- inference logic
- a backend API
- a frontend UI
- analytics and logging
- a candidate-model evaluation workflow
How I Built It
Data and Training Strategy
The system is trained on two fundamentally different types of data.
The first is a bias-classification corpus. I started from a small clinical vignette dataset, but it quickly became clear that it did not give me the class coverage or balance I needed for the production taxonomy I wanted LEA to use.
So I created my own curated supplemental examples to fill in missing categories and target failure modes I was seeing in evaluation. That ended up being a very important design decision. Generic medical NLP data was not enough; the classifier needed bias-specific supervision.
The second is a medical error correction corpus, including datasets like MEDEC and related reviewed datasets. These contained:
- full clinical notes
- annotated error sentences
- corrected sentences
- corrected notes
I used these for two different tasks:
- sentence-level localization
- structured feedback generation
The Models
The first model is a fine-tuned Bio_ClinicalBERT classifier that predicts one of five labels:
- Anchoring Bias
- Availability Bias
- Confirmation Bias
- Sunk Cost Fallacy
- No Bias Detected
The second model is another Bio_ClinicalBERT model trained as a note-sentence pair classifier for localization.
The third model is a fine-tuned Qwen2.5-7B-Instruct model trained with LoRA to generate:
- rationale
- missing danger
- safer rewrite
- next-step guidance
Mathematically, the classifier begins with a standard softmax over logits:
$$ p(y \mid x) = \mathrm{softmax}(z(x)) $$
But one of the key lessons of this project was that the raw neural output was not enough. So the deployed system became a hybrid of learned models and system-level logic.
What I Learned While Building the ML
The ML was not good at first.
Early on, the model made a lot of mistakes. It overpredicted some classes, confused others repeatedly, and sometimes gave outputs that were technically plausible but clearly wrong in context. In particular, I saw consistent confusion between categories like anchoring and availability bias, and I realized very quickly that I was not going to get a strong system just by training once and hoping.
That changed how I approached the project.
Instead of thinking, “train model, done,” I started thinking in terms of:
- dataset design
- calibration
- structured evaluation
- fallback logic
- postprocessing
- human correction loops
I built supplemental labeled examples by hand. I added a calibration layer on top of the classifier when I saw systematic confusion patterns. I created a candidate model workflow so I could train a new version, compare it to the baseline, and only switch if it actually improved. In one case, a retrained model improved some classes but made others worse, so I chose not to deploy it.
That felt like a really important moment for me. It made the project feel less like “I trained a model” and more like “I built an ML system.”
I also learned that strong ML systems are often hybrid systems. In LEA, the final product was not just:
- a classifier
- or an LLM
- or a chatbot
It became a layered pipeline with:
- trained models
- heuristic fallbacks
- output validation
- structured follow-up routing
- analytics and feedback
That was one of the biggest conceptual shifts in the whole build.
What Helped Me Get There
I was building alone, but I was definitely not building in a vacuum.
A lot of the conceptual ideas I explored came from working through problems with Claude, Gemini, and ChatGPT. Those tools were especially helpful when I needed to think through architectural tradeoffs, training strategy, evaluation ideas, or ways to frame the system more clearly.
And when I was deep in development and getting stuck in the code itself, Codex helped me a lot. It was especially useful for unblocking implementation issues, helping me reason through debugging steps, and moving faster when I was too tired to hold the whole codebase in my head at once.
That support mattered. A lot.
I still built the system, made the design decisions, evaluated the models, and decided what stayed or got thrown out. But using these tools thoughtfully became part of how I was able to keep pushing forward when the project became much more complex than I originally expected.
Challenges I Faced
This project was full of challenges.
1. Model performance
The first major challenge was that the initial classifier simply was not accurate enough. The errors were not random; they were structured. That meant I had to dig into the problem instead of just throwing more training at it.
2. Data mismatch
Not all medical datasets are useful for reasoning-bias detection. I had to learn quickly that adjacent medical NLP data is not the same thing as task-relevant supervision.
3. Deployment constraints
I also had to make real deployment decisions. For example, even though I trained a sentence localizer, the live system currently shows 2/3 models active because I chose to deploy a heuristic fallback for that middle stage instead of forcing a fragile artifact into production. That was a conscious choice: stability over pretending every experimental component was production-ready.
4. Generator reliability
The raw LLM outputs were not reliable enough on their own, so I had to add postprocessing and guardrails to keep the output from drifting into low-quality or benchmark-like nonsense.
5. Building alone under time pressure
And, honestly, the biggest challenge was just the experience of building this entire thing alone for my first hackathon.
The Human Side of This
At the time I’m writing this, it’s about 2 AM on Sunday, with only about 8 hours left in the hackathon.
I have been awake for basically the entire event.
I took one 20-minute nap, and that was it.
At this point I feel completely exhausted, a little delirious, and all I really want to do is sleep. But I also feel something else very strongly: I loved making this project.
This was one of the hardest and most satisfying things I’ve ever built.
I had moments where I felt lost, moments where the model looked terrible, moments where the deployment broke, moments where I wondered whether I had overreached, and moments where I genuinely did not know if I could pull it together.
But I kept going.
And no matter what happens with the outcome of the hackathon, I am proud of myself.
I’m proud that I started. I’m proud that I stayed with it. I’m proud that I built something real. And I’m proud that I finally let myself do the kind of ambitious technical project I had been too intimidated to try before.
Final Reflection
LEA began as an idea about clinical reasoning, but it became something bigger for me.
Technically, it became a hybrid ML system that combines:
- discriminative models
- generative models
- calibration logic
- fallback design
- structured evaluation
- human feedback
Personally, it became proof that I could do this.
That I could take an idea that mattered to me, sit down alone, and push it all the way into a functioning product under pressure.
And whether or not it wins anything, that means a lot to me.
Log in or sign up for Devpost to join the conversation.