Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

A new study challenges the conventional wisdom that diversity in training data always beats quality for non-English language models. Researchers systematically tested whether German language models benefit more from repeating smaller, heavily filtered datasets across multiple epochs versus training once on larger, lightly filtered corpora. The findings suggest that for resource-constrained practitioners working with high-resource languages, aggressive quality filtering paired with repetition may yield better sample efficiency than the diversity-first approach that dominates English LLM training. This reframes data curation strategy for practitioners building models outside the English-dominant research ecosystem.

Modelwire context

Explainer

The buried implication here is about compute budgets, not just data philosophy: if repetition over filtered data matches or beats single-pass training on larger corpora, smaller teams can build competitive German-language models without the crawl infrastructure that only well-resourced labs maintain. The finding also implicitly challenges how 'high-resource' is defined, since German sits in a middle tier where English-derived intuitions about data abundance may not transfer cleanly.

This connects loosely to the surprisal theory paper from arXiv cs.CL on April 30 ('On the Proper Treatment of Units in Surprisal Theory'), which also probed how evaluation frameworks built around English tokenization assumptions break down for other languages. Both papers are pointing at the same structural gap: the methodological defaults of LLM research were shaped by English data conditions and don't port cleanly elsewhere. Beyond that, this story is largely disconnected from recent Modelwire coverage, which has focused on multimodal alignment, market dynamics, and industry governance rather than multilingual pretraining.

Watch whether the filtered-repetition approach holds when tested on downstream task benchmarks like GermanQuAD or Euro-language MMLU variants, not just perplexity, since perplexity gains on filtered data can reflect distribution matching rather than genuine capability improvement.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGerman language models · arXiv

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.