Convergent Evolution: How Different Language Models Learn Similar Number Representations

Researchers discovered that multiple model architectures—Transformers, LSTMs, Linear RNNs, and classical embeddings—independently converge on periodic number representations in the Fourier domain, but only some achieve geometric separability for modular arithmetic. The finding reveals that data, architecture, optimizer, and tokenizer choices determine whether models learn truly usable numerical features.

Modelwire context

Explainer

The buried implication here is diagnostic: convergence on periodic representations is apparently necessary but not sufficient for useful numerical reasoning. A model can learn the right shape of number representation and still fail at modular arithmetic if the geometry doesn't separate cleanly, which means probing for Fourier structure alone won't tell you whether a model can actually do the math.

This connects most directly to the April 16 piece on 'How Embeddings Shape Graph Neural Networks,' which similarly isolated the embedding layer as a variable independent of backbone architecture. Both papers are asking the same underlying question: how much does the representation format determine downstream capability, versus the training regime around it? The convergent evolution framing here adds a wrinkle that the GNN paper didn't address, namely that architectures can arrive at structurally similar representations through entirely different paths, yet still diverge on what those representations support. The optimizer benchmarking piece from April 16 is also quietly relevant, since this paper flags optimizer choice as one of the factors determining whether geometric separability is achieved.

Watch whether any of the four architecture families studied here shows consistent failure on geometric separability across multiple tokenizer configurations. If tokenizer choice reliably breaks separability even when architecture and optimizer are held constant, that shifts the practical intervention point from training to preprocessing.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsTransformers · LSTMs · Linear RNNs

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.