gavel

Inspiration

RL on LLMs is expensive, and sparse credit assignment makes it even more painful. But we kept coming back to one thing: when you train with an LLM-as-judge, you're already paying to grade thousands of rollouts. That money is gone either way. So why throw away the judge's reasoning after you read off the score? If you just write it down, the grading you already paid for becomes a free, labeled training set. That idea is gavel.

What it does

Every time a frontier judge grades a rollout against a rubric, gavel logs its full reasoning trace alongside the score. The grading you had to do anyway turns into a distillation dataset for free. We then SFT a small model on those traces, and out comes a grader that does the same job for a fraction of the cost and latency.

The nice part is that the judge is always an OpenAI-compatible endpoint, so the distilled grader is a drop-in replacement. Point the same client at the small model instead of the teacher and nothing else changes. Better yet, you can reuse it the next time anyone trains on that task.

The whole thing runs as a three-step pipeline, GRPO then SFT then Audit:

GRPO: a strong LLM judges rollouts against a 0 to 9 rubric, and its trace and score get logged for free into the distillation set.
SFT: distill those traces into a small, cheap grader.
Audit: confirm the cheap grader actually reproduces the judge, and that it tracks real ground truth and not just the teacher's quirks.

How we built it

Trainer: TRL / verl GRPO
Model: Qwen3-1.7B with LoRA
Judge: Qwen2.5-7B served with vLLM
Compute: 8x H100

The core lives in a few files: grpo/grader.py is the rubric grader that logs the traces, grpo/data.py builds the verl parquet datasets, and grpo/launch.sh brings up the judge server and kicks off training. We ran our experiments on poly_easy and GSM8K.

Challenges we ran into

We wanted to use verl from the start, but it wouldn't let us do async rollouts, so we pivoted to TRL. TRL came with its own surprise: a garbage-collection bug where gradient checkpointing fought with KV cache hits, which meant we couldn't use the trainers as-is. We ended up writing a wrapper class around TRL so gradient checkpointing and KV caching could coexist.

Accomplishments that we're proud of

We took the sunk cost of LLM-as-judge grading and turned it into something reusable: a distilled grader that swaps in as a drop-in OpenAI-compatible endpoint and keeps the frontier judge's grading quality at a fraction of the cost and latency.