OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

Researchers tackle a fundamental problem in on-policy self-distillation for LLMs: teacher models generate biased or template-shifted responses during reasoning tasks, corrupting the token-level supervision that students learn from. OGLS-SD addresses this by using outcome rewards to identify which trajectories succeeded or failed, then steering logits to recalibrate teacher guidance before distillation. This bridges a gap between coarse-grained correctness signals and fine-grained learning, potentially improving how LLMs bootstrap their own reasoning without external data. The work matters for scaling reasoning models efficiently, especially as self-improvement becomes central to frontier model development.

Modelwire context

Explainer

The core insight worth flagging is that this work treats the teacher and student as the same model at different points in training, which means errors in teacher outputs don't just add noise, they actively reinforce the model's own bad habits in a feedback loop that standard distillation pipelines weren't designed to handle.

The calibration problem at the center of OGLS-SD connects directly to what we covered in 'ORCE: Order-Aware Alignment of Verbalized Confidence' (story 3), where researchers found that jointly optimizing answer quality and confidence signals causes each to degrade the other. OGLS-SD faces an analogous coupling problem: outcome correctness and token-level supervision are entangled in ways that corrupt training. Both papers are converging on the same structural insight, that disentangling coarse signals from fine-grained learning objectives produces cleaner results. The ORBIT piece on catastrophic forgetting during fine-tuning (story 5) also rhymes here, since self-distillation loops carry their own version of parameter drift risk when teacher guidance is systematically miscalibrated.

Watch whether OGLS-SD's logit steering approach holds up on multi-step mathematical reasoning benchmarks like MATH-500 or AIME under ablation, specifically whether removing the outcome-reward filtering alone collapses the gains, which would confirm the steering component is doing real work rather than just benefiting from better data selection.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsOGLS-SD · on-policy self-distillation · outcome-guided logit steering

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.