OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

OrScale addresses a fundamental constraint in neural network optimization by extending Muon, an orthogonalization-based training algorithm, with layer-wise adaptive scaling. The core innovation replaces global learning-rate control with per-layer trust ratios calibrated to the actual parameter updates each layer receives, solving three failure modes that plague simpler Muon-LAMB hybrids. For language model training specifically, OrScale-LM combines shape-aware scaling with one-time calibration, potentially reducing hyperparameter tuning burden and improving convergence stability. This matters to practitioners because optimizer efficiency directly impacts training cost and model quality, making incremental gains in scaling rules economically significant at scale.

Modelwire context

Explainer

OrScale's actual novelty is narrower than the framing suggests: it's not a new optimizer, but a calibration method layered onto Muon that addresses specific failure modes when combined with LAMB. The paper doesn't claim to outperform existing optimizers across all settings, only to stabilize a particular hybrid approach.

This sits in a broader pattern we've covered around reducing hyperparameter friction in LLM training. MatryoshkaLoRA (May 8) tackled rank selection in fine-tuning; OrScale targets learning-rate tuning in pretraining. Both assume that practitioners will adopt a method only if it reduces manual search burden. The difference: MatryoshkaLoRA learns hierarchies during training itself, while OrScale requires one-time calibration per architecture. Watch whether the field converges on which friction point matters more for adoption velocity.

If OrScale-LM shows stable convergence across three different model scales (1B, 7B, 70B) without retuning the calibration procedure, that validates the claim about reduced hyperparameter burden. If retuning is required per scale or per dataset, the practical advantage over standard LAMB shrinks significantly.

Coverage we drew on

MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsMuon · OrScale · OrScale-LM · LAMB

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.