Article

New RL Token Selection Method Cuts LLM Training Costs and Boosts Accuracy

DATE: 6/9/2025 · STATUS: LIVE

Alibaba and Tsinghua’s AI trimming unproductive steps finds hidden logic shortcuts, dramatically transforming math performance and leaving experts anxiously wondering…

New RL Token Selection Method Cuts LLM Training Costs and Boosts Accuracy

Article content

Large language models often produce step-by-step narratives called Chain-of-Thoughts, where each generated token plays a part in guiding the logic. Researchers have discovered that only a small fraction of these tokens truly steer the model toward different reasoning paths. By pinpointing and training on this critical subset, a team from Alibaba Inc. and Tsinghua University has managed to boost performance on challenging mathematical benchmarks while slashing computational waste.

CoT sequences rely on thousands of tokens, each chosen according to the model’s internal probability estimates. Standard reinforcement learning setups weigh every token equally when calculating gradients against a reward signal. This uniform treatment can dilute the impact of updates, since most tokens merely extend an existing statement rather than trigger a shift in logic. Tokens that serve as decision points—what this study calls “forking tokens”—are far more influential but get lost in the noise of blanket optimization.

Existing reinforcement learning algorithms for language models include Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Dynamic sAmpling Policy Optimization (DAPO). PPO uses a clipped objective to keep policy updates from swinging too far. GRPO replaces a separate value network by grouping responses to estimate advantage values. DAPO adds features such as a clip-higher mechanism and an overlength reward shaping term. All of these approaches still apply gradient updates across every output token, without distinguishing which ones actually matter for correct reasoning.

Shifting focus to entropy, the measure of uncertainty over token choices, the team examined CoT outputs from Qwen3 models of various sizes. They computed token-level entropy using the formula H = −∑ p(x) log p(x) at each generation step. Over half of all tokens registered entropy below 0.01, revealing almost deterministic behavior. Roughly 20% exceeded an entropy threshold of 0.672, indicating points where the model hesitated among several valid continuations. Analysis showed that high-entropy tokens often corresponded to words or operators that introduce new premises—terms like “assume,” “since,” or “thus.” In contrast, low-entropy tokens included predictable connectors, suffixes, or code fragments that simply built on what came before.

Armed with this insight, the researchers built a training pipeline that isolates updates to high-entropy tokens only. Policy gradients flow through those critical nodes in the reasoning chain, leaving the rest of the sequence untouched. This selective strategy preserves computational effort and concentrates learning on the moments that drive real decision-making.

Validation came via three versions of Qwen3: the 8-billion-parameter, 14-billion-parameter, and 32-billion-parameter models. When updates targeted only the top 20% most uncertain tokens, Qwen3-32B posted a score of 63.5 on AIME’24 and 56.7 on AIME’25. Both figures set new highs among models with fewer than 600 billion parameters. Stretching the maximum response length from 20 000 to 29 000 tokens lifted the AIME’24 result to 68.1. Training on the remaining 80% of low-entropy tokens produced a sharp decline in capability, confirming that filler tokens add little value.

Smaller models felt similar gains. Qwen3-14B improved by +5.21 on AIME’24 and +4.79 on AIME’25 when guided by the 20% threshold. Qwen3-8B matched or exceeded its full-token-training performance under the new regime. An ablation study checked other thresholds: a 10% cutoff dropped performance by omitting too many essential nodes, while 50% or 100% diluted the effect by reintroducing low-entropy tokens, which eroded diversity in the update signal.

Results indicate that a minority of tokens carry the weight of logical progression. By zooming in on those forks, the model learns to choose more accurate pathways without burning cycles on predictable continuations. Concentrated updates reduce the number of gradient calculations, lightening the training load. At the same time, the retained high-entropy tokens preserve the exploratory behavior needed to tackle novel or complex problems.

Key outcomes include:

• Twenty percent of generated tokens act as decision points that guide reasoning paths.
• Training exclusively on these high-entropy tokens matches or exceeds full-token training results.
• Qwen3-32B reached 63.5 on AIME’24 and 56.7 on AIME’25, outperforming larger models trained under traditional frameworks.
• Extending the response limit from 20 k to 29 k pushed the AIME’24 score to 68.1.
• Focusing on the low-entropy 80% causes steep declines in accuracy.
• Keeping the 20% high-entropy threshold balances exploration with effective gradient updates.
• Larger architectures benefit the most, thanks to their capacity for richer exploration.
• This approach scales to models with higher parameter counts, pointing the way to more efficient reasoning-focused training.

By aligning reinforcement learning objectives with the actual decision-making moments in a chain of thought, this method offers a streamlined route to stronger performance. Models learn from uncertainty rather than certainty, sharpening their ability to handle branching logic. The technique lays out a practical blueprint for future work on reasoning-centered architectures, where selective updates can drive smarter, more resource-aware progress.

Keep building

Join Skool — Ship Your First Microapp Back to feed