Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility

Researchers extended the TOFU unlearning benchmark across five languages to expose a critical gap in multilingual AI safety. Unlearning effectiveness varies dramatically by language pair, with transfer strongest between linguistically related tongues and weaker across distant families. Layer-wise analysis suggests unlearning concentrates in language-specific pathways rather than shared cross-lingual representations, raising questions about whether current forget-me techniques truly eliminate sensitive knowledge or merely obscure it within polyglot models. This work signals that safety interventions validated in English may not generalize reliably to non-English speakers, a material concern as LLMs scale globally.

Modelwire context

Explainer

The more unsettling finding is not that unlearning transfers poorly across languages, but that it may not truly erase knowledge at all. Concentrating in language-specific pathways means a model could retain sensitive information in one linguistic register even after passing safety evaluations conducted in another.

This connects directly to two threads already running on Modelwire. The 'Learning When to Translate for Multilingual Reasoning' piece from June 1st established that language comprehension gaps are structural, not incidental, in current LLMs. That framing matters here: if models process languages through partly separate internal pathways, then safety interventions applied in English are operating on a different slice of the model than the one a Turkish or Arabic speaker activates. Separately, the harm amplification work from June 1st showed that single-turn safety benchmarks miss multi-turn vulnerabilities. The same logic applies across languages: a benchmark that validates unlearning in one language is, by this paper's account, testing only a subset of where the forgotten knowledge actually lives.

Watch whether any of the major alignment labs publish unlearning audits that explicitly test retention across typologically distant language pairs, particularly on models already deployed in multilingual production settings. If those audits don't appear within the next two quarters, it suggests the field is treating this as an academic concern rather than a deployment risk.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsTOFU benchmark

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.