EvalLab

LLM JUDGE
Test Generator
Dashboard

Inspiration

Most developers ship LLM apps with zero systematic evaluation. They test manually, guess at quality, and have no idea if a prompt change made things better or worse. Research labs have entire eval pipelines — indie developers have nothing. EvalLab was built to close that gap.

What it does

EvalLab is a complete AI evaluation framework with five core features:

LLM Judge — scores any AI response on 6 axes (Accuracy, Relevance, Coherence, Safety, Creativity, Efficiency) and visualizes results on a radar chart
Red Team Engine — auto-generates adversarial attacks against your system prompt and scores jailbreak success for each
Cost-Quality Dashboard — scatter chart mapping every eval run by cost vs quality, so you can find where you're getting real value
Regression Tester — run test suites before and after prompt changes to catch quality regressions before they hit production
History Log — full audit trail of every run with CSV/JSON export

How we built it

Frontend: React + TypeScript + Vite
Database: Supabase (persistent eval history across sessions)
AI Inference: Groq API with llama3-70b-8192 (fast enough for real-time eval loops)
Charts: Recharts (RadarChart for LLM Judge, ScatterChart for Cost-Quality Dashboard)

Every feature writes to a shared Supabase schema, so the dashboard and history log get richer with every test you run. The efficiency metric is calculated as:

$$\text{Efficiency} = \frac{\text{Avg Quality Score}}{\text{Token Cost} \times 10^6}$$

Challenges we ran into

Red Team Engine requires 11 sequential Groq API calls per session (1 generation + 5 executions + 5 scorings). Staying under rate limits while keeping the experience fast was a real constraint.
Consistent JSON from LLMs — getting Groq to return reliable, parseable scoring JSON required careful prompt engineering and robust fallback parsing logic.
Cost display at micro-dollar scale — most LLM calls cost $0.000001–$0.0003. Making that feel meaningful rather than meaningless to users took iteration.
No backend — running AI inference directly from the browser with dangerouslyAllowBrowser: true works, but means the API key is client-side. A proper backend proxy is the right fix post-launch.

Accomplishments that we're proud of

LLM Judge produces scores that consistently match human judgment — tested against real inputs with an 8.7/10 overall score on a factually accurate but brief ML explanation, with Creativity correctly scored lower (6/10) than Accuracy (9/10)
The full eval loop from input to radar chart takes under 3 seconds on Groq
Clean, dark-themed UI that doesn't feel like a developer tool — something a non-technical user could navigate without documentation

What we learned

Evaluating AI is itself an AI problem. Writing prompts that produce reliable, calibrated, human-agreeable scores is harder than building the UI around them. The Red Team Engine in particular exposed how creative — and how predictable — LLM attack generation can be depending on how you prompt it.

What's next for EvalLab

Backend proxy to move the Groq API key server-side
Progress indicators for Red Team (show "Attack 3/5..." instead of a spinner)
Multi-model comparison — run the same eval against GPT-4, Claude, and Groq side by side
Team workspaces — share eval history across a team via Supabase RLS policies
Prompt versioning — track which prompt version produced which scores over time

Built With

groq
react
react-router
recharts
supabase
tailwind-css
typescript
vite

Updates

GAURAV KUMAR NAYAK started this project — May 09, 2026 07:16 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.