STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

STRIDE addresses a fundamental bottleneck in training data attribution for LLMs by shifting from parameter-space gradient tracking to activation-space modeling. Rather than repeatedly retraining models to measure causal influence, the framework uses sparse recovery to estimate how training examples shape model outputs. This matters because attribution remains critical for auditing, debugging, and defending against data poisoning, yet existing methods don't scale to billion-parameter models. The activation-space approach sidesteps both computational expense and the brittleness of local approximations, potentially unlocking interpretability at production scale.

Modelwire context

Explainer

The key detail the summary gestures past is the sparse recovery framing itself: STRIDE treats attribution as a compressed sensing problem, meaning it can identify influential training examples from a small number of subset perturbations rather than exhaustive retraining sweeps. That's a specific algorithmic bet, not just a vague 'activation-space' reorientation, and it carries real assumptions about sparsity in how training data influences outputs.

Attribution and interpretability have been recurring pressure points across recent Modelwire coverage. SafeSteer (from June 1) showed that activation-space representations are already being used to localize safety-critical behavior during training, and STRIDE extends that same intuition toward post-hoc auditing. More directly, the SPADE-Bench paper from the same day frames agent deception as an oversight problem, and STRIDE's attribution machinery is exactly the kind of tool that would need to exist before deception audits become operationally credible. The connection isn't incidental: both papers are circling the same gap between what a model does and what practitioners can verify about why.

Watch whether any of the major LLM auditing or red-teaming organizations (Anthropic, DeepMind safety teams, or third-party auditors) cite STRIDE in applied work within the next six months. Adoption there would signal the method holds up outside controlled benchmark conditions.

Coverage we drew on

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSTRIDE · Large Language Models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.