Modelwire
Subscribe

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

Illustration accompanying: SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

A new arXiv paper exposes critical implementation bugs in widely-used LLM training frameworks that have invalidated recent claims about mixed-policy optimization methods. The DeepSpeed optimizer bug silently drops gradient batches during accumulation, while OpenRLHF's loss weighting error compounds the problem, together creating a false performance gap that favors newer techniques over the standard SFT-then-RL baseline. Once corrected, conventional pipelines regain their edge, suggesting the field may have been chasing improvements that don't actually exist. This finding carries immediate implications for practitioners choosing training strategies and raises questions about reproducibility across downstream tools including TRL and Llama-Factory.

Modelwire context

Explainer

The deeper issue isn't just that two frameworks had bugs, it's that the bugs were directionally consistent, both systematically disadvantaging the SFT-then-RL baseline, which means any paper benchmarking against that baseline using these tools may have been measuring noise rather than signal.

This connects directly to a reproducibility problem that runs through several recent arXiv findings. The 'Override Gap' paper from the same day (covering hypernetwork-based adaptation) also identified a failure mode that was invisible at the level of reported results, where magnitude mismatches between adapter signals and pretrained priors collapsed accuracy in ways aggregate metrics obscured. Both papers share a common warning: evaluation infrastructure and training pipelines can quietly encode assumptions that corrupt conclusions before a researcher ever looks at a leaderboard. That pattern is worth treating as a category, not a coincidence. The materials discovery and affective computing papers from the same batch don't connect meaningfully here.

Watch whether the authors of the highest-cited mixed-policy papers from 2025 issue corrections or re-run ablations against the patched frameworks within the next 90 days. Silence from that group would itself be informative about how seriously the field treats infrastructure-level reproducibility.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsDeepSpeed · OpenRLHF · TRL · Llama-Factory · arXiv

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning · Modelwire