Inspiration
RL on LLMs is expensive, and sparse credit assignment makes it even more painful. But we kept coming back to one thing: when you train with an LLM-as-judge, you're already paying to grade thousands of rollouts. That money is gone either way. So why throw away the judge's reasoning after you read off the score? If you just write it down, the grading you already paid for becomes a free, labeled training set. That idea is gavel.
What it does
Every time a frontier judge grades a rollout against a rubric, gavel logs its full reasoning trace alongside the score. The grading you had to do anyway turns into a distillation dataset for free. We then SFT a small model on those traces, and out comes a grader that does the same job for a fraction of the cost and latency.
The nice part is that the judge is always an OpenAI-compatible endpoint, so the distilled grader is a drop-in replacement. Point the same client at the small model instead of the teacher and nothing else changes. Better yet, you can reuse it the next time anyone trains on that task.
The whole thing runs as a three-step pipeline, GRPO then SFT then Audit:
- GRPO: a strong LLM judges rollouts against a 0 to 9 rubric, and its trace and score get logged for free into the distillation set.
- SFT: distill those traces into a small, cheap grader.
- Audit: confirm the cheap grader actually reproduces the judge, and that it tracks real ground truth and not just the teacher's quirks.
How we built it
- Trainer: TRL / verl GRPO
- Model: Qwen3-1.7B with LoRA
- Judge: Qwen2.5-7B served with vLLM
- Compute: 8x H100
The core lives in a few files: grpo/grader.py is the rubric grader that logs the traces, grpo/data.py builds the verl parquet datasets, and grpo/launch.sh brings up the judge server and kicks off training. We ran our experiments on poly_easy and GSM8K.
Challenges we ran into
We wanted to use verl from the start, but it wouldn't let us do async rollouts, so we pivoted to TRL. TRL came with its own surprise: a garbage-collection bug where gradient checkpointing fought with KV cache hits, which meant we couldn't use the trainers as-is. We ended up writing a wrapper class around TRL so gradient checkpointing and KV caching could coexist.
Accomplishments that we're proud of
We took the sunk cost of LLM-as-judge grading and turned it into something reusable: a distilled grader that swaps in as a drop-in OpenAI-compatible endpoint and keeps the frontier judge's grading quality at a fraction of the cost and latency.
What we learned
https://x.com/allanzhangML/status/2068226931540279352
What's next for gavel
- A marketplace of verified autograders that people can reuse across tasks
- Going beyond
poly_easyand GSM8K to more datasets and benchmarks - A publication and a fully finished, open-source library
Built With
- python
- trl
Log in or sign up for Devpost to join the conversation.