Publication

Parallel Corpus Refinement as an Outlier Detection Algorithm

Kaveh Taghipour; Shahram Khadivi; Jia Xu

In: MT Summit XIII. Machine Translation Summit (MT Summit-11), 13. September 19-23, Xiamen, China, NA, Xiamen, 9/2011.

Abstract

Filtering noisy parallel corpora or removing mistranslations out of training sets can improve the quality of a statistical machine translation. Discriminative methods for filtering the corpora such as a maximum entropy model, need properly labeled training data, which are usually unavailable. Generating all possible sentence pairs (the Cartesian product) to generate labeled data, produces an imbalanced training set, containing a few correct translations and thus inappropriate for training a classifier. In order to treat this problem effectively, unsupervised methods are utilized and the problem is modeled as an outlier detection procedure. The experiments show that a filtered corpus, results in an improved translation quality, even with some sentence pairs removed.

Projects

Accurat - Analysis and Evaluation of Comparable Corpora for Under-Resourced Areas of Machine Translation

corpusFiltering2.pdf (pdf, 205 KB )