Skip to main content Skip to main navigation

Publikation

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Mehdi Vali; Manuel Brack; Max Lübbering; Elias Wendt; Abbas Goher Khan; Richard Rutmann; Alex Jude; Maurice Kraus; Alexander Arno Weber; David Kaczér; Florian Mai; Lucie Flek; Rafet Sifa; Nicolas Flores-Herr; Joachim Köhler; Patrick Schramowski; Michael Fromm; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2505.22232, Pages 1-38, Computing Research Repository, 2025.

Zusammenfassung

High-quality multilingual training data is essen- tial for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains lim- ited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scal- ability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs’ annotation capabilities into lightweight annotators based on pretrained mul- tilingual embeddings. These models exhibit robust multilingual and cross-lingual perfor- mance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic fil- tering methods like Fineweb2. JQL notably en- hances downstream model training quality and increases data retention rates. Our research pro- vides practical insights and valuable resources for multilingual data curation, raising the stan- dards of multilingual dataset development

Weitere Links