Research Tools & Code·arXiv cs.LG·May 20

HORST: Composing Optimizer Geometries for Sparse Transformer Training

Transformer sparsification has hit a fundamental wall: standard optimizers cannot simultaneously push models toward sparsity and keep training stable. Adaptive methods naturally favor L-infinity geometry (stability), while sparsity demands L-1 bias. HORST solves this by composing optimizer steps as non-commutative operators, using hyperbolic mirror maps to inject sparsity pressure without sacrificing convergence. The result is a modular optimizer that works across vision and language tasks. For practitioners scaling transformers, this addresses a real bottleneck in efficient model deployment, bridging the gap between theoretical sparsity and practical training robustness.

Modelwire context

Explainer

The non-commutativity point is the part worth sitting with: HORST is not simply blending two objectives but asserting that the order in which optimizer steps are applied materially changes the geometry of the loss landscape, which is a stronger and less obvious claim than standard multi-objective regularization.

The theoretical scaffolding here connects directly to the 'Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction' paper covered the same day, which tightened regret bounds for constrained optimization under adversarial conditions. HORST is essentially asking a related question in a different register: not how to satisfy hard constraints online, but how to compose geometric pressures during gradient descent without losing convergence guarantees. Both papers signal that the optimization theory layer beneath modern training pipelines is receiving serious renewed attention, likely because scaling alone is no longer sufficient to paper over inefficiencies in sparse or constrained regimes.

The practical test is whether HORST's sparsity gains hold when applied to models above the 7B parameter range on standard language benchmarks like MMLU or HellaSwag. If published follow-up results stay confined to vision tasks or sub-1B language models, the convergence story for large-scale deployment remains unverified.

Coverage we drew on

Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsHORST · Transformers · Sparse Training

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.