ASTRO Boosts Llama 3’s Math Reasoning by Up to 20% Through Post-Training
–
A new method has lifted the math skills of Llama-3.1-70B-Instruct by teaching search-like thinking, self-review, and backtracking. Developed by teams at Meta AI and the University of Washington, this framework called ASTRO (Autoregressive Search-Taught Reasoner) builds on a post-training process that grants LLMs a search mindset.
- MATH 500: 65.8 % → 81.8 %
- AMC 2023: 37.5 % → 64.4 %
- AIME 2024: 10.0 % → 30.0 %
The core of ASTRO is a Monte Carlo Tree Search that tracks every reasoning turn, right or wrong. In practice, the MCTS stage sifts through tens of thousands of reasoning branches, tagging each node as correct or incorrect. A technique called procedure cloning then flattens these trees into long chain-of-thought demonstrations embedding failed attempts and recoveries. These sequences get rewritten as natural-language CoT examples for supervised fine-tuning.
That process produces a model that does more than walk through steps; it reflects on each move, and when its confidence falls, it may say “I need to revisit that equation setup” before backtracking.
In the first stage, Llama-3.1-70B-Instruct was fine-tuned on 36.1 K CoT records from MATH, AMC/AIME and AoPS-style problems. The SFT version achieved:
- MATH 500: 69.6 %
- AMC 2023: 51.9 %
- AIME 2024: 16.3 %
Those figures compete with or exceed baseline and SPOC/Step-KTO variants that lack explicit search guidance. At the same time, the lift holds across tasks of different difficulty, from simple algebra questions to contest-level challenges. In fact, SFT alone gave a clear performance boost by exposing the model to reasoning traces that include search patterns.
The next phase uses reinforcement learning beginning from the SFT checkpoint. A custom Group Relative Policy Optimization loop applied simple reward feedback (+1 for correct, –1 for wrong) across 8.7 K moderately tough prompts. CoT outputs grew deeper over training, stretching from roughly 1.8 K to 6 K tokens per sample, which reflects a more thorough internal probe.
The final ASTRO-RL agent posted these scores:
- MATH 500: 81.8 %
- AMC 2023: 64.4 %
- AIME 2024: 30.0 %
These outcomes match or surpass models with higher parameter counts and highlight the value of a search-aware starting point.
Analyses also show a strong link between how often the model backtracks and its accuracy. As training matured, ASTRO-RL carried out more self-checks and deeper exploration. Pearson correlations on all benchmarks rose above 0.8, signaling that self-review and step reversal drive better results.
In side-by-side trials, models trained on direct CoT logs without search priors trailed ASTRO at each test. For example, ASTRO-RL outperformed Direct-RL by:
- 2 % on MATH 500
- 3.9 % on AMC 2023
- 2.9 % on AIME 2024
Outputs can be drawn as directed graphs, where each node marks a reasoning step and arrows show moves, reflections or corrections. Researchers can inspect these diagrams to track each step, spot common pitfalls and guide further tuning.
ASTRO shows that open models can gain human-like reasoning power not by adding parameters or extending pretraining, but via guided post-training that mimics search procedures. Teaching a model to plan ahead, question its own logic and fix mistakes sets a new standard for fine-tuning large language models. Because ASTRO can use existing open models and off-the-shelf compute, teams beyond large labs could adopt this strategy for applications outside math, such as coding or logic puzzles.