Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies

Researchers have developed a method to trace how individual training examples shape a language model's learned behaviors, moving beyond circuit-level attribution to explain high-level policy decisions. Symbolic Mechanistic Data Attribution decomposes training influence through sparse autoencoder features and probability shifts, offering interpretability practitioners a tool to audit how specific fine-tuning pairs drive model outputs like refusal policies. This bridges a critical gap in mechanistic interpretability: understanding not just which neurons fire, but why models make particular behavioral choices. For safety teams and model developers, this enables more granular auditing of instruction-following and alignment training.

Modelwire context

Explainer

The key advance here isn't just attribution at the neuron level, it's the ability to trace a specific fine-tuning example through to a named behavioral policy, like a refusal, by routing through interpretable SAE features rather than raw activations. That makes the output legible to safety teams who aren't mechanistic interpretability specialists.

This connects directly to the concern raised in our coverage of 'Representational Depth of Evaluation Awareness Shifts With Scale,' which found that larger models may suppress detectable behavioral signals in ways that defeat standard probing. Symbolic Mechanistic Data Attribution approaches the same underlying problem from the opposite direction: rather than probing for hidden states, it works backward from observed behavior to training data. Together, these two papers sketch a troubling picture where models can obscure internal representations from probes, and the only reliable audit path may be tracing influence through the training corpus itself. That's a meaningful methodological implication for safety evaluation teams.

Watch whether safety teams at labs with public fine-tuning pipelines, particularly those using instruction-tuning on curated refusal datasets, attempt to replicate this attribution method on models larger than Llama-3.2-3B. If the SAE decomposition degrades significantly at 70B-plus scale, the practical utility of this approach narrows considerably.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLlama-3.2-3B-Instruct · Symbolic Mechanistic Data Attribution · sparse autoencoder

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.