Publikation
Scaling Character-Based Morphological Tagging to Fourteen Languages
Georg Heigold; Günter Neumann; Josef van Genabith
In: Proceedings of IEEE International Conference on Big Data. IEEE International Conference on Big Data (IEEE BigData-16), IEEE BigData, December 5-8, Washinton, DC, DC, USA, IEEE, 12/2016.
Zusammenfassung
This paper investigates neural character-based
morphological tagging for languages with complex morphology
and large tag sets. Character-based approaches are attractive
as they can handle rarely- and unseen words gracefully. More
specifically, beside a rich morphology, non-canonical language,
change of language or other linguistic variability can heavily
degrade the accuracy of natural language processing of web
and CMC data. We evaluate on 14 languages and observe
consistent gains over a state-of-the-art morphological tagger
across all languages except for English and French, where
we match the state-of-the-art. The gains are clearly correlated
with the amount of training data. We present supplementary
experiments to explore whether and to what extent unsuper-
vised data through pre-trained word vectors can compensate
for limited amounts of supervised data. Moreover, we show
preliminary results to study the effect of noisy input data by
flipping characters at random.