Microsoft and Tsinghua Introduce Adaptive Reward Reasoning Models That Scale Compute on the Fly to Improve LLM Alignment
–
Over recent years, reinforcement learning (RL) has become central to improving large language models after base training. Two main strategies have emerged: one draws on human critique, called RLHF, the other relies on reward feedback that can be checked by a computer, often called RLVR. The latter has shown strong gains on tasks like math puzzles, but it calls for a pool of prompts with known solutions. That constraint makes it hard to scale RLVR to open-ended tasks where correct answers cannot be easily verified.
Current reward estimation methods fall into two categories. Numeric models assign a single score to each query-response pair. Generative models produce a brief text critique explaining strengths and weaknesses. Both types may use absolute evaluation, where each option stands alone, or comparative ranking, in which two answers face off. Generative judges can deliver human-readable feedback, but their training data may bias them. At evaluation time, every sample gets an identical compute budget. That one-size-fits-all policy can waste effort on simple prompts and leave no headroom for harder ones. Some systems create multiple outputs or extend reasoning traces to a fixed length, but they still ignore true input difficulty.
A joint effort by teams at Microsoft Research, Tsinghua University and Peking University led to a novel framework called Reward Reasoning Models (RRMs). They break the prediction into two steps: first, the model lays out a chain-of-thought, then it assigns a final score. This two-phase design treats compute as a flexible signal. When a case appears tough, the system can add more reasoning snippets before declaring a reward. Prior methods forced a static trade-off between depth and speed, but RRMs let each example steer its own computational path. Because the model produces its own reasoning, it can learn to self-correct without extra supervision for its thought process.
RRMs use the Qwen2 series, which relies on a Transformer-decoder core. The developers frame reward modeling as a text completion task. The model writes out its reasoning track and then appends a final judgment. Each test includes a user prompt plus two competing responses. The model must choose which answer it prefers and may not declare a tie. RRMs package the reasoning segment and final outcome in a single text stream, making them straightforward to train under standard language modeling objectives. Model weights remain unchanged from the pre-training stage except for the fine-tuning on reward signals.
For systematic study, the team turned to the RewardBench benchmark suite. It covers five key dimensions: instruction fidelity, usefulness, factual accuracy, safety, and level of detail. RRMs can score more than two options by borrowing ideas from competitive games. In one method, answers receive ELO-like ratings by facing off in pairwise matches. In another, options enter a single-elimination bracket until one remains. Both schemes can be combined with a simple majority vote at inference time. The model repeats each head-to-head comparison several times, then the vote picks the most common outcome. This process yields sturdy preference judgments when the model is on the fence.
In evaluations on RewardBench and a test set called PandaLM, RRMs held their own against top-tier alternatives. The largest variant, RRM-32B, delivered 98.6 percent accuracy on the reasoning subset. A DirectJudge model trained on the same data trailed by several points. In a reward-guided best-of-N setup, RRMs outpaced every baseline without raising their compute footprint. Adding the majority-vote step provided further gains across all evaluated categories. When faced with dozens of candidate replies, RRMs still pick the top answer more reliably than static critics.
Next, the group tested RRMs in post-training feedback loops on tasks like MMLU-Pro and GPQA. Scores climbed steadily after each feedback cycle, suggesting that the reward signals help fine-tune real-world reasoning. The steady gains on MMLU-Pro and GPQA show that reward-guided training can have lasting value beyond toy benchmarks. The team also ran a scaling sweep across model sizes of 7, 14 and 32 billion parameters. They found a clear pattern: allowing longer chains of thought consistently lifted reward-judging accuracy.
By inserting an explicit reasoning pass before the final judgment, RRMs tackle the rigid compute constraints that slowed earlier reward models. The built-in ability to scale both parallel sampling and sequential planning makes evaluation more efficient. That adaptive allocation of compute could prove valuable for any task that needs nuanced feedback, from enforcing safety checks to keeping text aligned with style guides. Researchers see this as a step toward more sophisticated alignment engines that tailor their effort to each input’s true complexity.