Publication
Parallel Corpus Refinement as an Outlier Detection Algorithm
Kaveh Taghipour; Shahram Khadivi; Jia Xu
In: MT Summit XIII. Machine Translation Summit (MT Summit-11), 13. September 19-23, Xiamen, China, NA, Xiamen, 9/2011.
Abstract
Filtering noisy parallel corpora or removing
mistranslations out of training sets can
improve the quality of a statistical machine
translation. Discriminative methods for filtering
the corpora such as a maximum entropy
model, need properly labeled training data,
which are usually unavailable. Generating all
possible sentence pairs (the Cartesian product)
to generate labeled data, produces an imbalanced
training set, containing a few correct
translations and thus inappropriate for training
a classifier. In order to treat this problem
effectively, unsupervised methods are utilized
and the problem is modeled as an outlier detection
procedure. The experiments show that
a filtered corpus, results in an improved translation
quality, even with some sentence pairs
removed.