Healthcare is NP-hard to verify. We synthesized the last 18 months of prevalent alignment research, eval theory, verifier asymmetry, and production-grade eval practice to make a deployable healthcare verifier. We layer five verification regimes into one system: benchmarks, red team, human review, llm-as-judge, and similarity metrics.
Probe makes it possible to verify a healthcare model’s safety in a specific domain context in minutes, not months, by converting their use case into a scalable verifier with benchmark anchors, robust edge cases, calibrated judging, and ground truth.
Healthcare AI doesn’t evolve slowly because it lacks intelligence. It lacks verifiability. We believe that all tasks which are easy to verify will be eventually solved. However healthcare is the opposite, correctness is subjective, verification is expensive, and human review doesn’t scale. Benchmarks don’t reflect deployment, and NP-hard reasoning requires smarter judges than we can afford.
The ease of training AI is proportional to how verifiable a task is. We convert messy clinical deployment and data into a structured verifier environment. The key is to break down the blackbox, we introduce a nonintuitive, deeply visible pipeline to attack this. Our architecture is modular. Spec, failure decomposition, human-in-loop test case generation, metrics, calibration. This is scalable oversight, dedicated to healthcare. While current systems can infer potential risks, we quantify the most difficult parts to receive ground truth on.
How it works
You provide us with what model you are using and describe what kind of environment the model is meant to run in. We will then take that environment, determine what failure modes are most saliently worth testing, and create some test questions/prompts for your agent to respond to. Some of these questions will be taken from applicable clinical studies related to your domain.
We'll also have you write a sample answer to some of these prompts to determine what style of response the agent will conform to. Then, we will evaluate your agents' answers to these prompts and generate metrics based on how well your agent did on each failure mode, as well as some of it's cross functional performance (demographic parity, citation accuracy, etc.)
The results come in an actionable report that details any critical failures, weaknesses, and areas of improvement.
An exhaustive list of the analyses we perform on your model:
- Critical failures (specific cases where your model was definitively wrong or suggested insufficient advice)
- PubMedBERT embedding analysis for similarity scoring between golden response and target model
- Hallucination prevention through citation check + UMLS concept validation
- Confidence calculation - by calculating expected calibration error
- Multi-step analysis
- Guideline adherence - using Five clinical guidelines, each with Class I recommendations: Heart failure (ACC/AHA/HFSA 2022), Diabetes (ADA), Hypertension (ACC/AHA 2017), Atrial fibrillation (ACC/AHA/HRS 2019), Pneumonia (IDSA/ATS CAP)
- Demographic disparity analysis: different accuracy when only the patient demographic wording
How we built it
This was always AI-forward to begin with, so we used v0 to quickly prototype the flow. We also used the Vercel AI SDK + Gateway to quickly switch between models under test and give us flexibility on model choice throughout the project. Even past the initial MVP, we were able to use v0 easily to add changes or rewrites to our project throughout as we repeatedly reconsidered our workflow.
Accomplishments we're proud of
Being able to iterate quickly was a great help because we were then able to really think about how to refine or improve the app instead of worrying about the engineering behind an MVP. This allowed us to spend time reading papers and researching about the subject to find failure modes and metrics that matter in healthcare, as well as the best ways (like PubMedBERT) to calculate these.
Challenges we ran into
Performance bottlenecks were a bit of a pain in testing and building, since all of our AI queries were sequential and separate for each individual scenario. We considered batching the queries but this could have reduced purity, since the AI might have previous questions/answers influence future generation. Additionally, we tried to parallelize the requests to the models since they were all in the cloud, but with our limited rate because of our free plans, we didn't see much improvement here and ran up against a hard wall.
What's next
Performance improvements - the evaluation of the model takes a long time as it has to answer questions in separate contexts for accuracy, and they're currently blocking. We tried parallelizing and batching the requests, but ran into rate limits on AI providers.
Horizontal integration - remove restrictions that this workflow only works for healthcare. Currently we rely on some domain-specific tests like using clinical information or pre-defining certain failure modes, which is good because it increases the depth and usefulness of our product in the field, but ideally we can adapt this to any industry so that leaders can evaluate their models in any field.
Feedback loops - Allow the models that we use to improve themselves through use in order to allow for future scaling.
Fine-tuned model - Improve question generation and improve scale by fine tuning the models we use for analysis and testing.
Built With
- anthropic-sdk
- huggingface
- javascript
- next.js
- node.js
- openai-sdk
- pubmed-api
- pubmedbert
- typescript
- vercel-sdk


Log in or sign up for Devpost to join the conversation.