OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

OrScale addresses a fundamental constraint in neural network optimization by extending Muon, an orthogonalization-based training algorithm, with layer-wise adaptive scaling. The core innovation replaces global learning-rate control with per-layer trust ratios calibrated to the actual parameter updates each layer receives, solving three failure modes that plague simpler Muon-LAMB hybrids. For language model training specifically, OrScale-LM combines shape-aware scaling with one-time calibration, potentially reducing hyperparameter tuning burden and improving convergence stability. This matters to practitioners because optimizer efficiency directly impacts training cost and model quality, making incremental gains in scaling rules economically significant at scale.
Modelwire context
ExplainerOrScale's actual novelty is narrower than the framing suggests: it's not a new optimizer, but a calibration method layered onto Muon that addresses specific failure modes when combined with LAMB. The paper doesn't claim to outperform existing optimizers across all settings, only to stabilize a particular hybrid approach.
This sits in a broader pattern we've covered around reducing hyperparameter friction in LLM training. MatryoshkaLoRA (May 8) tackled rank selection in fine-tuning; OrScale targets learning-rate tuning in pretraining. Both assume that practitioners will adopt a method only if it reduces manual search burden. The difference: MatryoshkaLoRA learns hierarchies during training itself, while OrScale requires one-time calibration per architecture. Watch whether the field converges on which friction point matters more for adoption velocity.
If OrScale-LM shows stable convergence across three different model scales (1B, 7B, 70B) without retuning the calibration procedure, that validates the claim about reduced hyperparameter burden. If retuning is required per scale or per dataset, the practical advantage over standard LAMB shrinks significantly.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsMuon · OrScale · OrScale-LM · LAMB
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.