Research·arXiv cs.LG·16h ago

How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

Researchers have cracked a long-standing gap in neural network theory: how long the infinite-width approximation actually holds when sequence depth and model width scale together. Modern recurrent models operate in regimes where both grow large simultaneously, yet prior signal propagation theory assumed width alone approaches infinity. This work derives exact finite-width formulas showing three distinct scaling regimes, with practical implications for understanding when theoretical predictions break down in real recurrent architectures. The finding matters for practitioners tuning state-space models and RNNs, since it clarifies which depth-width combinations preserve theoretical guarantees versus where empirical behavior diverges.

Modelwire context

Explainer

The paper doesn't just confirm that infinite-width breaks down at finite scale (known), but derives exact formulas for where and how sharply. The key insight is that depth and width must scale together in specific ratios to preserve theoretical predictions, not independently as prior work assumed.

This connects directly to the MIT scaling laws work from early May, which identified superposition as the mechanistic driver behind why models improve predictably with scale. That story explained the empirical pattern; this paper adds precision about the boundary conditions where those patterns hold. It also echoes the MLP initialization work from the same week, which showed how to replace empirical sampling with closed-form approximations at scale. Together, these three papers form a coherent thread: understanding what happens when you actually scale networks, not just in the limit.

If practitioners applying this framework to modern state-space models (like Mamba or S4) report that the predicted regime boundaries match empirical loss curves within 10% error, the theory has real predictive power. If they don't, the gap likely points to architectural details (like gating or normalization) that the linear recurrence model doesn't capture.

Coverage we drew on

MIT study explains why scaling language models works so reliably · The Decoder

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLinear recurrent models · Signal propagation theory · State-space models · RNNs

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.