Article

Post-Training Reinforcement Learning Sharpens Polaris-4B and 7B’s Math and Logic Performance

DATE: 6/28/2025 · STATUS: LIVE

Next-gen AI tackles mind-bending puzzles using humanlike logic and efficient training. Researchers refine data, exploration, context scope, but what unfolds…

$Post-Training Reinforcement Learning Sharpens Polaris-4B and 7B’s Math and Logic Performance$

Article content

Advanced reasoning systems lead current efforts in machine intelligence by handling challenging tasks in sectors such as math problem-solving and symbolic reasoning. These architectures carry out multiple-step calculations and logical deductions, often producing answers that resemble human thought patterns. Reinforcement learning methods refine performance after an initial pretraining phase, but growing these approaches at scale without losing efficiency remains a tough issue. Rising demand for versions that use fewer resources and retain strong reasoning skills has prompted scientists to explore improvements in data quality, exploration practices, and long-context handling.

One ongoing issue in reinforcement training for large reasoning models involves a mismatch between task difficulty and model ability. Presenting too many easy exercises causes progress to stall, and feeding in problems that are far too complex fails to generate useful learning signals. This gap grows even larger when strategies that succeed with smaller models transfer poorly to more sophisticated ones. An additional challenge comes from rigid approaches to output length and rollout variety, which limit a model’s ability to tackle intricate evaluation sets during both training and serving phases.

Early solutions like DeepScaleR and GRPO showed that reinforcement training can boost the performance of reasoning models with roughly 1.5 billion parameters. Applying those methods to heavier networks such as Qwen3-4B or Deepseek-R1-Distill-Qwen-7B, though, has yielded only minor improvements or even regressions. That falls back on static data distributions and narrow sampling strategies. These approaches tend to skip capability-based data filtering and leave sampling temperature and response length fixed throughout training. Models that face unchanging conditions struggle to grow in reasoning power as architectural complexity rises.

A team from the University of Hong Kong, Bytedance Seed, and Fudan University introduced Polaris, a post-training framework built to extend reinforcement learning for high-level reasoning. It was developed to meet challenges in symbolic logic, arithmetic puzzles, and other multi-step problem sets. The plan includes two initial models: Polaris-4B-Preview and Polaris-7B-Preview. Polaris-4B-Preview is fine-tuned from Qwen3-4B, and Polaris-7B-Preview starts with Deepseek-R1-Distill-Qwen-7B. The researchers crafted a framework independent of model design that adjusts question difficulty, boosts exploration with controlled temperature settings, and stretches inference limits through length extrapolation. Each preview variant underwent streamlined fine-tuning, making adaptation and deployment rapid. All components rely on publicly available datasets and open-source pipelines, and both preview versions run efficiently on off-the-shelf consumer GPUs.

Polaris incorporates several core innovations in its training workflow. The first move filters out math problems deemed trivial or impossible, resulting in a J-shaped distribution of challenge levels that aligns with the model’s advancing skill. Sampling temperature settings change in stages to preserve output variety. On Polaris-4B, training cycles use sampling temperatures set to 1.4, then 1.45, and finally 1.5. With Polaris-7B, the sequence begins at 0.7, then advances to 1.0, and concludes at 1.1. A key feature involves extending token context during inference via a Yarn-based extrapolation, boosting window size to 96,000 without retraining passes. This ‘train-short, test-long’ tactic lets the workflow easily avoid expensive long-sequence training runs. The approach also includes a Rollout Rescue Mechanism that replaces zero-reward batches with richer examples and Intra-Batch Informative Substitution, which recycles lower-yield steps to keep valuable training signals flowing even when each rollout holds only eight sequences.

In benchmark evaluations, Polaris preview models deliver top-tier accuracy across several math tests. In comparative trials, the 4B preview outscored multiple 32-billion-parameter systems on identical problem sets, showcasing the efficiency of controlled training variables. The smaller Polaris-4B-Preview achieved 81.2% on AIME24 and 79.4% on AIME25, outpacing a much larger Qwen3-32B and using under 2% of its parameter count. It also scored 44.0% on Minerva Math, 69.1% on the Olympiad Bench, and 94.8% on AMC23. Polaris-7B-Preview posted 72.6% on AIME24 and 52.6% on AIME25. These figures surpass results from competing systems such as Claude-4-Opus and Grok-3-Beta, establishing Polaris as a lightweight contender that bridges the gap between small open-source models and commercial offerings over 30 billion parameters.

These results confirm that model scale alone does not guarantee top reasoning performance. The Polaris strategy shows how careful management of data difficulty, sampling diversity, and inference context length can help compact systems match the reasoning skills of much larger commercial networks.

Keep building

Join Skool — Ship Your First Microapp Back to feed