Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Researchers identify a fundamental failure mode in self-distillation for math reasoning: privileged context inflates model confidence on structural tokens while suppressing deliberation signals needed for multi-step search. Anti-Self-Distillation inverts the training objective, maximizing divergence between student and teacher to preserve exploratory reasoning patterns. This addresses a critical gap where standard distillation succeeds in language tasks but fails in reasoning, suggesting that reasoning requires fundamentally different training dynamics than pattern matching. The finding reshapes how teams should approach capability scaling in domains requiring search and verification.

Modelwire context

Explainer

The paper's most underreported implication is directional: it suggests that the entire distillation literature for reasoning tasks may have been optimizing the wrong objective, not just suboptimally, but in the wrong direction. Standard distillation loss treats teacher-student agreement as the goal, but for multi-step search problems, that agreement actively destroys the property you're trying to transfer.

This sits in direct tension with the same day's coverage of 'Learning to Foresee,' which argues that on-policy distillation accelerates training through trajectory stabilization. Both papers are studying distillation dynamics, but they reach structurally different conclusions: one finds that teacher alignment produces useful gradient structure, while Anti-Self-Distillation finds that alignment collapses the exploratory behavior reasoning actually depends on. The reconciliation probably lives in task type, language modeling versus multi-step search, but neither paper addresses the other's domain directly, so the apparent conflict is real and unresolved.

If a team applies Anti-Self-Distillation to a non-math reasoning benchmark like GPQA or ARC-AGI and sees the same divergence-preserving gains, the mechanism is general. If results are confined to structured math tasks, the effect may be specific to domains where solution paths are enumerable and verifiable.

Coverage we drew on

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsAnti-Self-Distillation · Pointwise Mutual Information

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.