Publication
Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust
Markus Freitag; Ricardo Rei; Nitika Mathur; Chi-kiu Lo; Craig Stewart; Eleftherios Avramidis; Tom Kocmi; George Foster; Alon Lavie; André F. T. Martins
In: Proceedings of the Seventh Conference on Machine Translation. Conference on Machine Translation (WMT), December 7-8, Abu Dhabi, United Arab Emirates, Pages 46-68, Association for Computational Linguistics, 12/2022.
Abstract
This paper presents the results of the WMT22 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT22 News Translation Task on four different domains: news, social, ecommerce, and chat. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages, among other things: (i) expert-based evaluation is more reliable, (ii) we extended the pool of translations by 5 additional translations based on MBR decoding or rescoring which are challenging for current metrics. In addition, we initiated a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics' ability to capture and penalise specific types of translation errors. Finally, we present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. The results demonstrate the superiority of neural-based learned metrics and demonstrate again that overlap metrics like Bleu, spBleu or chrf correlate poorly with human ratings. The results also reveal that neural-based metrics are significant better than non-neural metrics across different domains and challenges.
Projekte
SocialWear - SocialWear - Socially Interactive Smart Fashion,
TextQ - Analyse und automatische Abschätzung der Qualität maschinell generierter Texte
TextQ - Analyse und automatische Abschätzung der Qualität maschinell generierter Texte