Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

Researchers show that quality signals in embedding space transfer across languages, enabling high-resource language classifiers to filter training data for low-resource ones. Testing on a 1B model trained on 103B tokens, multilingual pooling outperformed monolingual baselines in both stability and accuracy, suggesting a scalable path for data curation in imbalanced multilingual settings.

Modelwire context

Explainer

The paper's real contribution isn't just better filtering; it's the empirical demonstration that quality is partly a language-agnostic property in embedding space, which challenges the common assumption that you need native-language signal to judge native-language data. The 103B token scale also makes this one of the larger controlled tests of cross-lingual transfer for pretraining curation specifically.

The closest thread in recent coverage is the pair of LLM judge reliability papers from mid-April ('Context Over Content' and 'Diagnosing LLM Judge Reliability'), both of which exposed how automated quality signals break down under pressure. This paper is essentially asking the same question one layer upstream: can a quality classifier generalize at all, before it even reaches the evaluation stage? The judge reliability work focused on inference-time scoring; this focuses on training-data selection. They're different problems, but together they sketch a fragile pipeline where quality signals are assumed to transfer more robustly than the evidence supports.

The real test is whether multilingual pooling holds its advantage when the low-resource language is typologically distant from the high-resource training set. If a follow-up covers languages like Yoruba or Burmese rather than European language clusters, and the gains persist, the cross-lingual transfer claim is substantially stronger.

Coverage we drew on

Context Over Content: Exposing Evaluation Faking in Automated Judges · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLarge Language Models · multilingual pretraining · cross-lingual transfer

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.