Please read our paper!: https://drive.google.com/file/d/13dm7vzINmqoUWBBg1IkuJQc_T0ziAzYz/view?usp=sharing.
We present S⁴D (Self-Taught Semi-Self Speculative Decoding), the first and SOTA reinforcement-learned routing policy for hierarchical speculative decoding. We present three key contributions: (1) a learned per-token gating policy that recasts the choice among verification tiers as a sequential decision problem, replacing the hand-tuned confidence thresholds of prior hierarchical speculative decoding with a tiny, q-free MLP that never invokes the target model to make its decision; (2) reinforcement learning with a quality–cost reward that balances fidelity to the target model against the latency of expensive verifier calls, trained with principled per-token credit assignment (REINFORCE, GRPO, and a per-token-advantage variant); and (3) a self-derived slim verifier obtained by an offline layer-skip search (DIMR), so the assumption that hierarchical speculative decoding must rely on hand-tuned thresholds, and show that a learned gate Pareto-dominates the fixed-threshold baseline, matching lossless accuracy (0.91) at 4× fewer full-model calls and reaching up to 3.5× speedup, establishing a new state of the art in hierarchical speculative decoding.

Log in or sign up for Devpost to join the conversation.