Publikation
Very large language models for machine translation
Christian Federmann
Mastersthesis, Universität des Saarlandes, Saarbrücken, Germany, 7/2007.
Zusammenfassung
Current state-of-the-art statistical machine translation relies on statistical language models
which are based on n-grams and model language data using a Markov approach. The quality
of the n-gram models depends on the n-gram order which is chosen when the model is trained.
As machine translation is of increasing importance we have investigated extensions to improve
language model quality.
This thesis will present a new type of language model which allows the integration of very large
language models into the Moses MT framework. This approach creates an index from the
complete n-gram data of a given language model and loads only this index data into memory.
Actual n-gram data is retrieved dynamically from hard disk. The amount of memory that is
required to store such an indexed language model can be controlled by the indexing parameters
that are chosen to create the index data.
Further work done for this thesis included the creation of a standalone language model server.
The current implementation of the Moses decoder is not able to keep language model data
available in memory, instead it is forced to re-load this data each time the decoder application
is started. Our new language model server moves language model handling into a dedicated
process. This approach allows us to load n-gram data from a network or internet server
and can also be used to export language model data to other applications using a simple
communication protocol.
We conclude the thesis work by creating a very large language model out of the n-gram data
contained within the Google 5-gram corpus released in 2006. Current limitations within the
Moses MT framework hindered our evaluation efforts, hence no conclusive results can be
reported. Instead further work and improvements to the Moses decoder have been identified
to be required before the full potential of very large language models can be efficiently
exploited.