Inspiration

Frontier labs are buying more expert-generated post-training data, but the expensive question is: which data will actually make the model better after RL? Training on every candidate cohort works, but it does not scale.

We built xLift as a pre-training data scout: a way to estimate which cohorts are worth training on before spending compute. Our core belief is that the best data is not simply the hardest or most diverse. It should sit near the model’s learning frontier, expose recoverable failures, teach transferable lessons, and have a verifier that is safe to optimize.

We were also inspired by GEPA-style prompt optimization. If a model can extract a reusable lesson from one split and apply it to held-out tasks, that may indicate real learning signal. We treated GEPA-transfer as one promising signal, while keeping the main experiment focused on post-GRPO lift and reward integrity.

What it does

xLift predicts post-training lift before training.

For each cohort, xLift computes cheap pre-training signals:

[ xLift(D) = \text{Frontier Signal} + \text{Transfer Signal} + \text{Reward Integrity} + \text{Coverage Signal} - \text{Efficiency Cost} ]

Key signals include:

  • Frontier Score: does the cohort create mixed success/failure outcomes where RL has something to reinforce?
  • Repair Gain: does feedback turn failures into correct answers?
  • GEPA Transfer: do lessons from one split improve held-out tasks?
  • Reward Trust: does the verifier reward real correctness, or can it be hacked?
  • Coverage / Redundancy: does the cohort teach broad patterns or repeat templates?

The output is a data-scout recommendation: train, skip, diversify, fix the verifier, or build curriculum.

How we built it

We used Qwen2.5-1.5B-Instruct and ran GRPO across controlled cohorts: easy, frontier, hard, mixed, and weak-verifier. Before training, we computed xLift signals from model rollouts, verifier scores, repair attempts, and a GEPA-style transfer prototype.

Then we trained on each cohort using the same GRPO setup and measured:

[ \text{lift} = \text{accuracy}{after} - \text{accuracy}{before} ]

We compared xLift predictions against actual post-RL lift and placed the metrics on a cost-quality Pareto frontier.

Challenges we ran into

The biggest challenge was experimental power. Running multiple GRPO jobs, evaluating lift, and getting tight confidence intervals is hard in 24 hours. Some frontier and transfer metrics were directionally promising, but the middle cohorts had wide confidence intervals, so we could not honestly claim we perfectly ranked them.

GEPA-transfer was also useful as a prototype signal, but it needs larger-scale validation. We also had to avoid overloading the project with too many metrics and focus on the signals most tied to post-RL lift.

Accomplishments that we're proud of

We built an end-to-end pipeline that computes pre-training signals, runs GRPO per cohort, evaluates lift, and compares predictions against actual model improvement.

Our clearest result was the weak-verifier cohort: training reward increased, but true accuracy decreased. The model learned to exploit the reward rather than improve the underlying skill.

xLift correctly flagged that cohort as the lowest-scoring one before training. That is exactly the kind of failure a data-scouting layer should catch.

What we learned

The biggest lesson is that task quality and reward quality are separate. A cohort can look learnable, but if the verifier is gameable, RL can push the model in the wrong direction.

We also learned that useful post-training data should be frontier-level, repairable, transferable, and reward-safe. Our reward-trust result was strongest; the frontier-ranking results need more cohorts, more rollouts, and larger eval sets.

What's next for xLift

Next, we would scale xLift to more cohorts, larger held-out evals, and code benchmarks. We would also improve verifier red-teaming, especially for coding tasks where hardcoded or test-specific solutions can pass weak graders.

We also want to expand GEPA-transfer: if a model can extract a useful lesson from a dataset, those lessons could help generate better synthetic variants, rewrite weak tasks, improve rubrics, and build curriculum examples.

Longer term, xLift could become a data-scouting layer for expert data marketplaces like Mercor: helping decide which expert-generated tasks are worth buying, which verifiers need cleanup, and which cohorts are ready for training.

Built With

  • anthropic-claude-api
  • bootstrap-confidence-intervals
  • grpo
  • hugging-face-transformers
  • matplotlib
  • numpy
  • pandas
  • pytorch
  • qwen2.5-1.5b-instruct
  • verifier-based-evaluation
Share this project:

Updates