Modelwire
Subscribe

Do Sparse Autoencoders Capture Concept Manifolds?

Illustration accompanying: Do Sparse Autoencoders Capture Concept Manifolds?

Sparse autoencoders have become central to mechanistic interpretability work, but a fundamental assumption about how they encode concepts may be wrong. This paper challenges the prevailing linear-feature model by showing that concepts often live on continuous manifolds rather than isolated directions. The authors develop a theoretical framework distinguishing two capture modes: global (compact atom clusters spanning entire manifolds) and local (distributed across multiple features). This matters because it reshapes how researchers should design and validate SAEs for real-world interpretability tasks, potentially invalidating conclusions from studies that assumed independence between concept directions.

Modelwire context

Explainer

The paper's sharpest contribution isn't just the critique of linear feature assumptions, it's the two-mode taxonomy: some concepts compress neatly into a single SAE direction (global capture), while others require a distributed coalition of features (local capture). That distinction gives practitioners a diagnostic vocabulary they didn't have before, which is more immediately useful than the theoretical critique alone.

The geometry-first framing here connects directly to the S2VAE work covered the same day ('Beyond Gaussian Bottlenecks'), which argued that standard latent representations fail to preserve the topological structure of the spaces they encode. Both papers are, at root, making the same complaint about different systems: the math we use to compress representations assumes a flatness that the underlying data doesn't have. Where S2VAE addresses this in vision transformers by swapping in Power Spherical distributions, this paper surfaces the analogous problem inside interpretability tooling. Neither paper proposes a fully validated fix, which is worth noting.

Watch whether any of the major SAE-based interpretability groups (Anthropic's interpretability team being the most visible) publish a replication or rebuttal within the next two quarters. If they adopt the global/local capture framing in follow-on work, the taxonomy is sticking; if they ignore it, the paper may remain a theoretical provocation without practical uptake.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSparse Autoencoders · Neural Networks · Mechanistic Interpretability

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Do Sparse Autoencoders Capture Concept Manifolds? · Modelwire