Publikation
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models
Mehdi Vali; Manuel Brack; Max Lübbering; Elias Wendt; Abbas Goher Khan; Richard Rutmann; Alex Jude; Maurice Kraus; Alexander Arno Weber; David Kaczér; Florian Mai; Lucie Flek; Rafet Sifa; Nicolas Flores-Herr; Joachim Köhler; Patrick Schramowski; Michael Fromm; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2505.22232, Pages 1-38, Computing Research Repository, 2025.
Zusammenfassung
High-quality multilingual training data is essen-
tial for effectively pretraining large language
models (LLMs). Yet, the availability of suitable
open-source multilingual datasets remains lim-
ited. Existing state-of-the-art datasets mostly
rely on heuristic filtering methods, restricting
both their cross-lingual transferability and scal-
ability. Here, we introduce JQL, a systematic
approach that efficiently curates diverse and
high-quality multilingual data at scale while
significantly reducing computational demands.
JQL distills LLMs’ annotation capabilities into
lightweight annotators based on pretrained mul-
tilingual embeddings. These models exhibit
robust multilingual and cross-lingual perfor-
mance, even for languages and scripts unseen
during training. Evaluated empirically across
35 languages, the resulting annotation pipeline
substantially outperforms current heuristic fil-
tering methods like Fineweb2. JQL notably en-
hances downstream model training quality and
increases data retention rates. Our research pro-
vides practical insights and valuable resources
for multilingual data curation, raising the stan-
dards of multilingual dataset development
