Wiki Dumps to Training Corpora: South Slavic Case

Researchers have developed a systematic pipeline for converting Wikimedia dumps into high-quality training corpora for seven South Slavic languages, addressing a critical gap in multilingual LLM training data. The work tackles two core challenges: extracting usable text from wiki markup and filtering out low-signal database-generated content via n-gram analysis. This methodology directly impacts the feasibility of building capable language models for underrepresented language families, where public training data remains scarce and often noisy. The approach is replicable across other low-resource language groups, making it strategically relevant for organizations scaling multilingual model development.
Modelwire context
ExplainerThe paper's contribution isn't just a cleaned dataset but a replicable pipeline with explicit filtering logic for database-generated wiki content, a category of noise that standard text extraction tools typically ignore entirely. That reproducibility detail is what separates this from a one-off data release.
This work sits at the foundation of a cluster of multilingual capability problems Modelwire has been tracking this week. The Marco-MoE paper (story 8) demonstrates that sparse multilingual models can route language-specific computation efficiently, but that architecture only performs well when the underlying training data is clean and representative. South Slavic languages are unlikely to appear in Marco-MoE's training mix at meaningful quality without exactly this kind of upstream pipeline work. Similarly, the cross-lingual jailbreak detection paper (story 1) notes that safety mechanisms fail for non-English languages partly because guardrail training data is thin. Better corpora for underrepresented languages address that root cause, even if indirectly.
Watch whether any of the seven language corpora produced here get adopted into a public multilingual benchmark or a named model's training card within the next six months. Adoption by a third party would validate the pipeline's quality claims in a way self-reported filtering metrics cannot.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsWikimedia · Wikipedia · Wikisource · Wikibooks · Wikinews · Wikiquote
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.