Research·arXiv cs.LG·14h ago

C$^{2}$R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders

$Illustration accompanying: C$^{2}$R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders$

Sparse autoencoders have become central to LLM interpretability work, but scaling them reveals a critical flaw: features split across multiple latents or get absorbed into catch-all dimensions, degrading the reliability of mechanistic explanations. Researchers propose Cross-sample Consistency Regularization to enforce stable feature assignments across different inputs, addressing a fundamental constraint in current SAE design. This matters because interpretability tools are only useful if their outputs are trustworthy, and this work directly improves the fidelity of feature decomposition at scale, making SAEs more viable for production interpretability pipelines.

Modelwire context

Explainer

The paper's deeper contribution is less about the regularization technique itself and more about what it reveals: that SAE feature decompositions are currently unstable enough across inputs that the same underlying model behavior can map to different latents depending on which samples you run, making cross-run comparisons of mechanistic explanations unreliable by default.

Interpretability work sits downstream of training and embedding choices that are themselves poorly understood, which is exactly the tension surfaced in the 'Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms' paper from the same day. That work showed that geometry we thought was noise actually carries signal, and this SAE paper is the mirror image: structure we thought was signal (stable feature assignments) turns out to carry noise. Together they suggest the field is still in the process of auditing its own measurement tools before those tools can be trusted for production use. The consistency problem described here also has implications for agentic pipelines like those in 'Self-Evolving World Models for LLM Agent Planning,' where interpretability-guided interventions would need stable feature references to be actionable.

If teams running production SAE pipelines (Anthropic's published work is the most visible benchmark here) adopt cross-sample consistency as a standard evaluation metric within the next two quarters, that signals the field has accepted instability as a real problem rather than a theoretical one.

Coverage we drew on

Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSparse Autoencoders · Large Language Models · Cross-sample Consistency Regularization

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.