Large language models often undergo an extra alignment phase that fine-tunes their behavior using reinforcement learning guided by human feedback or task-based correctness. This process helps models meet user expectations, making them better suited for instruction-driven applications and exact mathematical computations.
Researchers face a choice between two main training strategies: offline methods that rely on fixed, pre-generated datasets and fully online approaches that update the model with each new interaction. Offline techniques cannot adapt during training, which limits their end performance, whereas online schemes demand substantial computational resources to keep pace with streaming data.
Earlier efforts have leaned on Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) for alignment. DPO runs offline on preference-based data pairs and is prized for its simplicity and data efficiency, though it lacks real-time adaptability. GRPO builds on the PPO algorithm, adjusting policies by comparing groups of outputs and computing relative advantages in real time. This on-policy design boosts responsiveness to changing reward signals but raises compute costs and slows down experimentation cycles.
A joint team from Meta and NYU tested a semi-online strategy that strikes a balance between these extremes. Instead of syncing the model’s generation and training modules at every step or not at all, they introduced a variable synchronization interval. By controlling how often new samples influence training, this setup cuts down on overall run time while preserving adaptability. The modular design lets researchers switch between DPO and GRPO and plug in task-specific reward models as needed.
The group fine-tuned the Llama-3.1-8B-Instruct model on both open-ended prompts and math problems. For non-verifiable tasks, prompts were drawn from the WildChat-1M dataset and scored by the Athene-RM-8B reward model, which assigns scalar judgments. For verifiable math challenges, they used the NuminaMath collection alongside the Math-Verify toolkit to confirm correctness. Experiments ran on 32 NVIDIA H200 GPUs for training and eight GPUs for inference, with comparisons among offline, semi-online (at different sync intervals), and fully online setups.
Performance gains were clear on Math500: offline DPO hit 53.7% accuracy, while semi-online DPO at a sync rate of s = 100 climbed to 58.9%. Online DPO and GRPO recorded 58.7% and 58.1%, respectively. On the NuminaMath benchmark, offline DPO reached 36.4%, and the semi-online variant at s = 10 improved to 39.4%. Open-ended evaluations using AlpacaEval 2.0 and Arena-Hard also favored models trained with a mix of verifiable and non-verifiable rewards. Blending both reward types in one run yielded stronger average scores across benchmarks.
This work shows that strict offline or fully online regimes aren’t mandatory for effective LLM fine-tuning. A flexible synchronization schedule can lower compute demands and boost results across diverse task types without sacrificing adaptability.

