Susceptibilities and Patterning: A Primer on Linear Response in Bayesian Learning

A new theoretical framework for interpreting neural networks through susceptibilities, derived from Bayesian learning principles, offers a unified lens for understanding how model components respond to data perturbations. By connecting posterior covariances to influence functions and structural patterns, this work enables practitioners to map which network features activate in response to specific data distributions, advancing the interpretability toolkit beyond black-box analysis. The approach has direct implications for debugging model behavior, detecting spurious correlations, and building more transparent learning systems.

Modelwire context

Explainer

The paper formalizes a bridge between Bayesian posterior geometry and neural network influence functions, but the critical omission is scope: it's unclear whether this framework scales to modern architectures or remains a tool for post-hoc analysis of smaller models.

This work sits at the center of a coherent interpretability push across the archive. The susceptibilities framework here parallels the mechanistic interventions used in recent papers on tool calling and latent planning (both from this week), which also measure how model components respond to perturbations. However, where those papers validated their findings through causal interventions on actual model activations, this one derives theory from Bayesian principles without yet demonstrating causal validation on real networks. That gap matters because the mechanistic interpretability audit from the same week flagged exactly this problem: papers invoking causal language without disclosing identification assumptions. The susceptibilities approach needs to be explicit about what assumptions allow practitioners to move from correlation in posterior covariances to causal claims about feature activation.

If the authors release code that successfully predicts which neurons activate in response to held-out data distributions on a 7B+ parameter model, and those predictions hold up under activation patching (not just linear probing), that confirms the framework has practical teeth. If no such validation appears within six months, it remains a theoretical contribution without a clear path to the interpretability workflows practitioners actually use.

Coverage we drew on

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsBayesian learning · Neural networks · Influence matrix · Fluctuation-dissipation theorem

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.