Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

Entropy collapse during RL fine-tuning of language models degrades rollout diversity and learning signal quality. Researchers propose Temperature-Scaled On-Policy Self-Distillation, a parameter-level fix that embeds exploratory behavior directly into model weights rather than relying on external temperature adjustments or entropy penalties. The method reconstructs a self-teacher via high-temperature logit scaling, then distills knowledge back into the collapsed checkpoint. This addresses a concrete bottleneck in reasoning-focused RL workflows, offering practitioners a lightweight alternative to existing mitigation strategies that operate outside the model itself.

Modelwire context

Explainer

The key insight is that TS-OPSD embeds exploratory behavior into weights during training rather than applying temperature scaling at inference time. This matters because it means the fix persists across downstream use cases, whereas external temperature adjustments require practitioners to tune per-deployment.

This connects directly to the broader pattern surfaced in recent coverage on LLM reasoning instability. The cross-generational adversarial attack paper (late May) showed that safety and robustness don't scale linearly, and the multi-agent debate decomposition work revealed that apparent reasoning improvements often mask model collapse into spurious agreement. TS-OPSD targets the same underlying problem from the training side: when models lose exploratory diversity during RL, they converge to brittle, low-signal rollouts. The self-distillation approach is a parameter-level intervention where earlier mitigation strategies relied on external knobs.

If teams report that TS-OPSD checkpoints maintain rollout diversity on held-out reasoning benchmarks (GPQA, ARC-Challenge) without requiring per-task temperature tuning, that validates the claim that the fix is truly internalized. If adoption remains confined to research settings and practitioners continue relying on inference-time temperature scaling in production, the method hasn't solved the actual deployment friction.

Coverage we drew on

Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsTemperature-Scaled On-Policy Self-Distillation · TS-OPSD

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.