When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

Illustration accompanying: When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

Researchers tracked how attention-head circuits crystallize during pretraining across three 1B-parameter models, revealing that certain architectural constraints (like the absence of BOS-attractor heads in early layers) are hardwired rather than learned. This mechanistic-interpretability study spanning dense transformers and mixture-of-experts architectures provides empirical grounding for understanding when and why specific attention patterns emerge, directly informing both model design choices and interpretability frameworks that practitioners use to debug and predict model behavior at scale.

Modelwire context

Explainer

The paper's core finding isn't just that attention patterns emerge during training, but that certain architectural limitations (like missing BOS-attractor heads) appear to be structural constraints baked into model design rather than learned behaviors that could be trained away.

This connects directly to the SAE interpretability work from earlier this month ('How Optimality Structures Sparse Dictionaries'), which tackled the theory of why feature extraction works at all. Where that paper formalized the conditions for interpretable decomposition, this one empirically maps when specific circuits actually form across different architectures. Together they're building a mechanistic grammar: SAEs tell us what features exist, this work tells us when and why those features crystallize. The finding also echoes the congruence-expressivity bottleneck paper, suggesting that architectural choices constrain what the model can learn, not just how efficiently it learns.

If researchers can identify which architectural constraints are removable versus fundamental by testing whether adding BOS-attractor capacity in early layers changes the attention-head formation timeline, that would validate whether these are design choices or hard limits. Watch for follow-up work on OLMoE and other MoE variants to see if mixture-of-experts architectures exhibit different crystallization patterns, which would indicate whether sparsity itself reshapes circuit emergence.

Coverage we drew on

How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsPythia 1B · OLMo 1B · OLMoE · The Pile · DCLM

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.