LLM Lab

Inspiration

There are many large language models to choose from. Each one has pros, cons, costs, and other considerations. I wanted a way to quickly evaluate the best model for a particular task and report on the outcomes.

What it does

Select one or multiple models from the major companies, and select a pre-defined or custom Benchmark Task. Run the evaluation to generate a UI and PDF report that ranks each model in various categories. View the inputs, outputs, costs, etc for each model.

How we built it

Vite PWA with Bolt.new.

Challenges we ran into

Securely managing the user's API keys.

Accomplishments that we're proud of

Simple 3 click workflow for running an evaluation and outputting a report.
Detailed docs on generating API keys.
Detailed analysis through the UI as well as a printable PDF report

What we learned

Each model has strengths and weaknesses, not easily described until seeing them compared to other models.

What's next for LLM Lab

Organization-level account features
Sharing outputs easily across the internet
'Best Fit' features that will find the perfect model based on user preferences

Built With

anthropic
openai
react
vite

Updates

Jacob Blazina started this project — Jun 30, 2025 06:23 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.