US-ML-Eval

Inspiration

Recently, language models such as ChatGPT have proven capabilities to pass medical examinations such as the USMLE. Some LLM companies have begun launching models for health purposes without considering the impact of bias. Large language models also offer new opportunities for understanding the impact of bias in medical data in applied settings.

What it does

This program compiles USMLE step-3 case questions and alters demographic details, such as race, gender, and disability. Models test each other on whether they are capable of answering correctly across a variety of demographic factors.

How we built it

I used an Inspect base in Python and added Gemini, Claude, and OpenAI APIs. I also used Gemini prompt engineering to develop the database.

Challenges we ran into

computer has broken and deleted my code

Accomplishments that we're proud of

not crying

What we learned

always commit before closing vscode

What's next for US-ML-Eval

Nobel peace prize

Built With

chatgpt
claude
gemini

Updates

Alex Cooper started this project — Mar 21, 2026 11:38 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.