Article

Samsung’s 7-Million-Parameter Model Topples Massive LLMs on ARC-AGI Reasoning Benchmark

DATE: 10/8/2025 · STATUS: LIVE

A tiny 7-million-parameter model out-smarts giant LLMs on tough reasoning tests, and when researchers probed deeper they found puzzling anomalies…

Samsung’s 7-Million-Parameter Model Topples Massive LLMs on ARC-AGI Reasoning Benchmark

Article content

A paper by Alexia Jolicoeur-Martineau of Samsung SAIL Montréal argues that a very small neural network can outperform massive Large Language Models (LLMs) on hard reasoning benchmarks. The Tiny Recursive Model (TRM) uses just 7 million parameters — less than 0.01% of the size of many leading LLMs — and posts new best results on tasks such as the ARC-AGI intelligence test. The work challenges the prevailing belief summed up by the industry mantra “bigger is better.” It proposes a far more parameter-efficient route for tackling complex reasoning problems.

LLMs are impressive at producing fluent, human-like prose, but they can be brittle when a multi-step chain of logic is required. These models generate output token-by-token, so an early mistake can propagate and break a long reasoning sequence, yielding an invalid final result. One mitigation has been Chain-of-Thought techniques, where a model “thinks out loud” to break a problem into intermediate steps. That approach can improve correctness, yet it demands heavy computation and large amounts of high-quality reasoning data that are not always available, and it still produces flawed logic on some tasks that require exact, deterministic execution.

TRM traces its roots to the Hierarchical Reasoning Model (HRM). HRM introduced a pair of small neural networks that operate on a problem recursively at different frequencies to refine a solution. The design showed promise but carried conceptual and practical complexity, relying on uncertain biological analogies and on fixed-point mathematical arguments that were not guaranteed to hold in every case. The new TRM refines that line of thought with a much simpler architecture.

Rather than using two distinct networks, TRM runs a single, compact network in a recursive loop that updates both its internal “reasoning” state and its proposed “answer.” During a run, the model receives three items: the problem statement, an initial guess for the solution, and a latent reasoning feature. It performs several internal refinement steps to update the latent reasoning, then uses that improved internal state to revise the answer. The full recursion can be repeated up to 16 times, giving the model multiple chances to detect and correct earlier errors in a highly parameter-efficient way.

A surprising finding from the study is that a two-layer TRM generalizes better than a four-layer variant. Reducing depth appears to limit the model’s tendency to memorize idiosyncrasies in small, specialized datasets, which helps with performance on held-out examples. That outcome runs counter to the common instinct to stack more layers for greater capacity, and it points to trade-offs between expressivity and robustness when training on scarce reasoning data.

TRM also removes a major theoretical complication that HRM depended on. HRM’s training rationale required assuming the model’s internal functions converged to a fixed point, which complicated both proofs and implementation. TRM abandons that route and simply back-propagates through its entire recursion during training. In ablation experiments this adjustment produced a dramatic jump in performance: accuracy on the Sudoku-Extreme benchmark rose from 56.5% to 87.4% when training used full back-propagation through recursion instead of the fixed-point assumption.

The empirical results are striking across multiple tasks. On Sudoku-Extreme, which provides only 1,000 training examples, TRM reaches 87.4% test accuracy versus HRM’s roughly 55%. On Maze-Hard, a long-path search task on 30×30 grids, TRM posts 85.3% against HRM’s 74.5%. The gains extend to ARC-AGI, a benchmark designed to probe fluid, human-like intelligence through abstraction and transformation problems: with 7M parameters TRM achieves 44.6% on ARC-AGI-1 and 7.8% on ARC-AGI-2. That performance bests HRM’s results from a 27M-parameter model and exceeds many much larger LLMs; for comparison, Gemini 2.5 Pro scores 4.9% on ARC-AGI-2.

Training efficiency received attention as well. The team simplified an adaptive control mechanism called ACT — the component that decides when a given example has been improved enough that the training process can move on to another sample — so the model no longer needs a second, expensive forward pass per training step. That optimization reduces compute without producing a large drop in final generalization performance, according to the reported experiments.

Taken together, these insights form a clear technical case against the singular focus on scaling parameter counts as the primary path to better reasoning. TRM demonstrates that iterative refinement and explicit self-correction inside a compact module can resolve extremely difficult problems while using a tiny fraction of the parameters and compute typically associated with state-of-the-art LLMs. The approach also highlights a different set of levers for progress: architectural design choices, training-through-recursion, and careful handling of overfitting on small datasets.

The paper leaves open plenty of follow-up questions. Researchers will want to test TRM-style recursion on other reasoning domains, study how the approach interacts with larger foundation models, and probe limits such as how many recursive steps are optimal across tasks. For now, Alexia Jolicoeur-Martineau’s TRM provides a clear example that smaller, specially designed networks can outperform far bigger models on targeted, logic-heavy challenges.

Keep building

Join Skool — Ship Your First Microapp Back to feed