Publication
N-Gram Language Modeling for Robust Multi-Lingual Document Classification
Jörg Steffen
In: Proceedings of the 4th International Conference on Language Resources and Evaluation. International Conference on Language Resources and Evaluation (LREC), Pages 731-734, ELRA, 2004.
Abstract
Statistical n-gram language modeling is used in many domains like speech recognition, language identification, machine translation, character recognition and topic classification. Most language modeling approaches work on n-grams of terms. This paper reports about ongoing research in the MEMPHIS project which employs models based on character-level n-grams instead of term n-grams. The models are used for the multi-lingual classification of documents according to the topics of the MEMPHIS domains. We present methods capable of dealing robustly with large vocabularies and informal, erroneous texts in different languages. We also report on our results of using multi-lingual language models and experimenting with different classification parameters like smoothing techniques and n-grams lengths.