Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

Researchers using sparse autoencoders have isolated specific neural features in GPT-2 that correlate with task failure on indirect object identification, revealing that a single feature labeled 'cryptographic keys' drives 93% failure rates when prompts mention keys. This work advances mechanistic interpretability by moving beyond aggregate performance metrics to pinpoint which learned representations cause model errors, offering a replicable audit methodology that could inform both debugging and safety analysis of language models at scale.

Modelwire context

Explainer

The paper's core contribution isn't just finding that a 'cryptographic keys' feature breaks indirect object identification. It's demonstrating a replicable audit methodology that moves mechanistic interpretability from post-hoc explanation toward predictive debugging: you can now identify which learned representations will cause failures before deployment, not after observing them in production.

This work sits alongside the post-training state distribution paper from May 21st in a broader shift toward granular visibility into model internals. Where that research reframed optimization around which training states matter most, this one zooms into which learned features cause specific failures. Both reject the idea that aggregate loss or performance metrics tell the full story. The sparse autoencoder methodology here also connects to the multi-task learning theory piece from the same date: if shared representations can be audited at the feature level, practitioners can verify whether task-sharing actually produces the interference patterns theory predicts.

If Neuronpedia or similar feature-cataloging tools release similar audits on GPT-2 Medium or larger models within the next six months and identify consistent failure-driving features across model scales, that confirms the methodology generalizes. If the same 'cryptographic keys' feature doesn't appear in those larger models, or if it appears but doesn't correlate with task failure, the finding may be an artifact of GPT-2 Small's specific training dynamics rather than a robust debugging signal.

Coverage we drew on

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGPT-2 Small · Sparse Autoencoders · Indirect Object Identification · Bloom · Neuronpedia

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.