Publication
Mining Parallel Resources for Machine Translation from Comparable Corpora
Santanu Pal; Partha Pakray; Alexander Gelbukh; Josef van Genabith
In: Alexander Gelbukh (Hrsg.). Computational Linguistics and Intelligent Text Processing, 16th International Conference, CICLing 2015, Proceedings. International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2015), April 14-20, Cairo, Egypt, Pages 534-544, Lecture Notes in Computer Science (LNCS), Vol. 9041, ISBN 978-3-319-18110-3, Springer, 2015.
Abstract
Good performance of Statistical Machine Translation (SMT) is
usually achieved with huge parallel bilingual training corpora, because the
translations of words or phrases are computed basing on bilingual data.
However, in case of low-resource language pairs such as English-Bengali, the
performance is affected by insufficient amount of bilingual training data.
Recently, comparable corpora became widely considered as valuable resources
for machine translation. Though very few cases of sub-sentential level
parallelism are found between two comparable documents, there are still
potential parallel phrases in comparable corpora. Mining parallel data from
comparable corpora is a promising approach to collect more parallel training
data for SMT. In this paper, we propose an automatic alignment of English-
Bengali comparable sentences from comparable documents. We use a novel
textual entailment method and distributional semantics for text similarity.
Subsequently, we apply template-based phrase extraction technique to aligned
parallel phrases from comparable sentence pairs. The effectiveness of our
approach is demonstrated by using parallel phrases as additional training
examples for an English-Bengali phrase-based SMT system. Our system
achieves significant improvement in terms of translation quality over the
baseline system.