On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

Researchers identify a fundamental tradeoff in on-policy self-distillation, a technique that boosts single-attempt accuracy by having models learn from their own correct outputs. The work reveals that this approach systematically reduces the diversity of generated rollouts, causing performance gains to plateau when sampling multiple completions. The finding matters because it exposes a hidden cost to a popular training method: practitioners optimizing for pass@1 metrics may inadvertently cripple their models' ability to explore solution space, limiting real-world utility in domains requiring multiple attempts or diverse reasoning paths.

Modelwire context

Explainer

The paper doesn't just measure a tradeoff; it identifies self-distillation itself as the mechanism causing diversity collapse. This matters because the technique has been treated as a straightforward win in recent practice, and this work reveals the cost is structural, not incidental.

This connects to the broader question of what we can actually infer about model behavior from limited observations. The RevengeBench work from the same day frames policy reconstruction as an inverse problem, asking whether we can reverse-engineer decision logic from behavior. Here, the inverse problem is different: practitioners see pass@1 gains and assume the model is simply 'better,' but this paper shows the model may have actually narrowed its exploration strategy. Both pieces highlight how surface metrics can mask what's happening inside the system's actual decision-making process.

If teams report that models trained with self-distillation show lower diversity on held-out reasoning tasks (not just sampling metrics) within the next two quarters, that confirms the diversity loss is real and not just an artifact of the evaluation setup. If diversity stays stable on those tasks, the finding may be specific to certain domains or model scales.

Coverage we drew on

RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

Mentionsself-distillation · on-policy learning · pass@k evaluation

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.