Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Researchers have identified a practical principle for finetuning large language models: using the same optimizer during supervised finetuning as was used during pretraining reduces catastrophic forgetting while maintaining or improving task performance, outperforming both alternative optimizers and parameter-efficient methods like LoRA. The finding suggests optimizers function as implicit regularizers that shape model geometry around pretrained checkpoints, offering practitioners a simple lever for balancing knowledge retention against new task acquisition without architectural changes.

Modelwire context

Explainer

The paper isolates optimizer selection itself as a regularization mechanism, not just a hyperparameter tuning detail. The claim is that matching pretraining and finetuning optimizers reduces forgetting more effectively than LoRA or other architectural workarounds, suggesting the optimizer's implicit bias toward the pretrained loss landscape is doing the heavy lifting.

This connects directly to the MIT scaling work from early May, which identified superposition as the mechanistic driver behind why models scale predictably. Both papers treat model internals as having discoverable structure rather than black boxes. Where the MIT work explained why more parameters help, this paper explains how optimizer geometry preserves what those parameters learned. The finding also complements the federated unlearning work (EASE, same week), which tackled how knowledge couples across modalities; here the coupling is temporal (pretraining to finetuning) and the lever is optimizer choice rather than architectural decoupling.

If practitioners report that optimizer matching outperforms LoRA on proprietary domain-specific finetuning tasks over the next two quarters, the finding has moved beyond academic validation. Conversely, if the effect disappears when finetuning datasets exceed 10% of pretraining scale or when using different model families, it suggests the mechanism is narrower than claimed.

Coverage we drew on

MIT study explains why scaling language models works so reliably · The Decoder

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLoRA

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.