Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

Researchers reframe post-training optimization around state distributions rather than loss functions alone, arguing that SFT, RL, and on-policy distillation differ fundamentally in which training states they sample from. Using Qwen 3 0.6B on GSM8K with retention checks on TruthfulQA and MMLU, the work identifies three key phenomena that challenge conventional loss-centric analysis. This perspective shift matters for practitioners tuning alignment pipelines: it suggests that which prompts and prefixes a model learns from during post-training may be as consequential as the objective itself, potentially unlocking more efficient fine-tuning strategies and better transfer across domains.
Modelwire context
ExplainerThe paper's practical provocation is that two fine-tuning runs using identical objectives and identical data can produce different models simply because they sample different prefixes and intermediate states during training. That reframes debugging alignment failures as a data-coverage problem, not just a loss-design problem.
The distillation angle connects directly to our coverage of 'The Distillation Game' from the same day, which examined how student models exploit teacher outputs during on-policy learning. That piece treated distillation primarily as a security surface, but this paper adds a complementary lens: the states a student samples from a teacher determine what behavioral distribution it actually learns, independent of the objective. Together, the two papers suggest that on-policy distillation is doing more work than practitioners typically account for, both in terms of what knowledge transfers and what attack surface opens up. The manifold optimization paper from the same batch is not a natural connection here.
If a follow-up applies this state-distribution framing to a model larger than 0.6B parameters and shows the same three phenomena hold on GPQA or a held-out reasoning benchmark, the framework earns generalization claims. Results that only survive on GSM8K at small scale should be treated as preliminary.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsQwen 3 · GSM8K · TruthfulQA · MMLU · SFT · RL
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.