Skip to main content Skip to main navigation

Publication

Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task

Markus Freitag; Nitika Mathur; Daniel Deutsch; Chi-kiu Lo; Eleftherios Avramidis; Ricardo Rei; Brian Thompson; Frederic Blain; Tom Kocmi; Jiayi Wang; David Ifeoluwa Adelani; Marianna Buchicchio; Chrysoula Zerva; Alon Lavie
In: Philipp Koehn; Barry Haddow; Tom Kocmi; Christof Monz (Hrsg.). Proceedings of the Ninth Conference on Machine Translation. Conference on Machine Translation (WMT-24), November 15-16, Miami, Florida, USA, Association for Computational Linguistics, 11/2024.

Abstract

The WMT24 Metrics Shared Task evaluated the performance of automatic metrics for machine translation (MT), with a major focus on LLM-based translations that were generated as part of the WMT24 General MT Translation Task. As LLMs become increasingly popular in MT, it's crucial to determine if existing evaluation metrics can accurately assess the output of these systems. To provide a robust benchmark for this evaluation, human assessments were collected using Multidimensional Quality Metrics (MQM), continuing the practice from the previous year. Furthermore, building on the success of the previous year, a challenge set subtask was included, requiring participants to design contrastive test suites that specifically target a metric's ability to identify and penalize different types of translation errors. Finally, the meta-evaluation procedure was refined to better reflect real-world usage of MT metrics, focusing on pairwise accuracy at both the system and segment levels. We present an extensive analysis on how well metrics perform on three language pairs: English->Spanish (Latin America), Japanese->Chinese, and English->German. The results strongly confirm the results reported last year, that neural fine-tuned metrics continue to stay strong also for LLM-based translation systems.