TAP — Predicting Which Data Will Improve Your Model, Before You Train

TAP, Trajectory Advantage Predictor, is a cheap, self-improving model that learns f(GRPO signals) → predicted post-RL lift. We present a new model architecture that predicts if a given RL dataset will improve an LLM without ever training on it. Our TAP architecture beats the current SOTA (inference function methods) in terms of both cost and performance.

Inference-Time Compute Hackathon (Applied AI Track). Built on Qwen2.5-Math-1.5B and Qwen3-1.7B with GRPO/RLVR across math, code, science, and general knowledge.

For an interactive exploration of our training results, visit: this link. The presentation slides can be found here.


Vision & Problem Fit

Frontier labs are buying huge amounts of RL data from domain experts. Most of it, however, is noisy and does not affect the model in terms of performance. We therefore wanted to find out if there is a way to filter for high-signal data, without needing to run the full RL train + eval loop. In this way, we avoid burning GPUs and hours on data that may not help at all.

That makes data selection the real bottleneck of RLVR, and it's exactly the hackathon's question: what task- and dataset-level metrics correlate with a model's post-RL performance gain, why, and where do they sit on the cost–quality Pareto frontier?

Our research introduces a new training architecture, Trajectory Advantage Predictor (TAP), which is able to predict training results without actually training the model. Instead of training to find out, we predict the lift from signals we can read off a handful of rollouts the policy already produces. The unit of prediction is a cohort (a small dataset), the target is the measured post-RL lift, and the predictor is cheap enough to run on every candidate batch. It directly answers "will this data make the model better, and by how much?" before you pay for it.


Technical Execution

We built the whole pipeline end-to-end, and the core of it works as follows:

  1. Post-train (GRPO) a battery of cohorts across subjects and measure the true lift on a held-out probe.
  2. Extract 23 cheap features (derived via testing and from existing literature) from the same rollouts. A single forward pass is significantly cheaper to run, than a full backwards pass, which allows us to extract features which encode model internals.
  3. Train a small gradient-boosted model f(features) → lift (ridge benchmarked as a baseline).
  4. Validate with leave-one-cohort-out, leave-one-domain-out, and decision/selection metrics.

What actually works (measured):

  • Real multi-subject study: Qwen3-1.7B across 4 subjects (science, code, math, general knowledge) — 320 cohorts, 960 labels.
  • Predicted vs. actual lift: pooled Spearman ρ ≈ 0.78, Pearson 0.72 (leave-one-out); shown two datasets, the model picks the better one ~78% of the time.
  • Selection (the decision that matters): pick the top 10% of datasets by our score → +68% more improvement than random, capturing ~70% of a perfect, noise-free oracle — and it beats picking by the noisy measured lift itself.
  • Near the measurement ceiling: we're at ~97% of what 3-seed labels physically permit (ceiling ≈ 0.80); the gap is label noise, not the predictor.
  • A working closed loop: an online predictor re-scores fresh cohorts against the current model each step and trains the winner. Over 8 cumulative GRPO steps, online NLL −0.539 (improves) vs random +0.331 (worsens)+0.869 nats advantage, with the predictor learning online (cold-starts as the advantage-spread heuristic, ridge engages by step 4).

Novelty & Insight

The key insight which allowed us to beat SOTA was that the most expensive part of a train-eval loop is the backwards pass. By performing cheap forward passes, we are able to extract features that encode what the model has learned, and train a smaller model to learn an internal representation for the RL process itself.

Futhermore, we also discovered:

  • The cheapest signal is the best one. The dominant predictor of post-RL lift is advantage spread / disagreement (adv_std, frac_nondegenerate) encodes a very high GRPO learning signal.
  • One universal law across subjects. Our research demonstrates that TAP generalizes across subjects, and therefore should scale well. We discovered that across math, code, science, and general knowledge with nearly the same slope. Because the signal needs no per-subject calibration, the predictor transfers to an unseen subject zero-shot (gap ≈ 0) and still selects data +118% better than random in held-out domains.
  • The cheap proxy out-selects the expensive ground truth. Ranking cohorts by our signal beats ranking by their own measured 2–3-seed lift — because the feature carries less noise per observation than the outcome it predicts.

Relative to prior work (TuneAhead for SFT, datamodels, influence functions, RHO-loss/learnability, online difficulty filtering), ours is the RL version: it predicts post-RL lift from in-the-loop GRPO signals, transfers across subjects, and is inference-only — versus influence functions, which need training-scale gradient/Hessian machinery for a comparable signal. Same predictive power, a fraction of the cost — a strict Pareto win on cost.


Impact & Trajectory

If this works at full scale, it changes how RL post-training spends compute:

  • Save GPUs: stop training on data that won't help; route compute to high-lift cohorts only.
  • Buy smarter: score a vendor's dataset before paying to train on it — directly useful for the data marketplaces driving frontier post-training.
  • Self-sharpening RL loop: a drop-in scorer for any RLVR stack that gets better the more it's used (every real run is a new training example).
  • Continual RL on real-world data: cheaply flag which of your users' live traces are high-lift and train on only those — improving from real use while preventing loss of plasticity.
  • Composes to any batch size: rank at fine (cohort) granularity where the predictor is calibrated, then pool the top-ranked cohorts into production-scale batches.

Trajectory from here: more seeds push ranking from 0.78 → ~0.89 (10 seeds) toward a ~0.95 ceiling; validate at production batch sizes; extend from MCQ/short-answer to long generative math/code; and add gradient-aware features for the final fidelity.


Presentation & Demo

What we built, in one line: a predictor that maps cheap GRPO signals to predicted post-RL lift, plus a working RL loop that uses it to steer training.

Does it work? Yes — and we can show exactly to what extent:

  • On held-out cohorts across 4 subjects, predicted vs. actual lift is ρ ≈ 0.78 (a heatmap of all 320 cohorts shows the density hugging the diagonal).
  • Used to select data, it delivers +68% over random (top 10%) and ~70% of a noise-free oracle.
  • In a live loop, it drives held-out loss down (−0.539 nats) while random selection drifts up (+0.331).

The deck walks the full story: the problem → the idea (f(GRPO signals) → lift) → our process → a concrete worked example (a cohort, its 8 rollouts, and the predicted vs. actual lift) → results → generalization to unseen subjects → the causal clincher → the cost–quality frontier vs. SOTA → the vision of the predictor inside the RL loop → and the cheap-inference / continual-RL trajectory. Appendix includes the full feature list with importances and the live-loop prototype data.

Our approach is deliberately rigorous about honesty throughout, we made sure that every headline number is held-out and labeled, and every limitation is on the slide.

Conclusion

We introduce a new training architecture, TAP, which predict model improvement without needing to actually train the model. Furthermore, as TAP is a zero-shot learner, it can therefore scale well to new subjects it has never seen before. Therefore, TAP provides a path to reducing compute in RLVR/LoRA settings.


Built with: GRPO/RLVR (verifiers + prime-rl style stack), Qwen2.5-Math-1.5B & Qwen3-1.7B, gradient-boosted trees + ridge, on 8×H100 / Prime Intellect compute.

Built With

Share this project:

Updates