Skip to main content Skip to main navigation

Publication

DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments

Rémi Calizzano; Malte Ostendorff; Georg Rehm
In: Julian Risch; Anke Stoll; Lena Wilms; Michael Wiegand (Hrsg.). Proceedings of the GermEval Workshop 2021 -- Shared Task on Toxic, Engaging and Fact-Claiming Comments. GermEval at Conference on Natural Language Processing (GermEval-2021), Düsseldorf, Germany, Pages 25-31, University of Klagenfurt, 9/2021.

Abstract

We present our submission to the first subtask of GermEval 2021 (classification of German Facebook comments as toxic or not). Binary sequence classification is a standard NLP task with known state-of-the-art methods. Therefore, we focus on data preparation by using two different techniques: task- specific pre-training and data augmentation. First, we pre-train multilingual transformers (XLM-RoBERTa and MT5) on 12 hatespeech detection datasets in nine different languages. In terms of F1, we notice an improvement of 10% on average, using task-specific pre-training. Second, we perform data augmentation by labelling unlabelled comments, taken from Facebook, to increase the size of the training dataset by 79%. Models trained on the augmented training dataset obtain on average +0.0282 (+5%) F1 score compared to models trained on the original training dataset. Finally, the combination of the two techniques allows us to obtain an F1 score of 0.6899 with XLM-RoBERTa and 0.6859 with MT5. The code of the project is available at: https://github.com/airKlizz/germeval2021toxic.

Projects