Publication
Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus
Eleni Metheniti; Günter Neumann
In: LREC. International Conference on Language Resources and Evaluation (LREC-2020), May 1-4, LREC, 5/2020.
Abstract
Multilingual, inflectional corpora are a scarce resource in the NLP community, especially corpora with annotated morpheme boundaries.
We are evaluating a generated, multilingual inflectional corpus with morpheme boundaries, generated from the English Wiktionary
(Metheniti and Neumann, 2018), against the largest, multilingual, high-quality inflectional corpus of the UniMorph project (Kirov et al.,
2018). We confirm that the generated Wikinflection corpus is not of such quality as UniMorph, but we were able to extract a significant
amount of words from the intersection of the two corpora. Our Wikinflection corpus benefits from the morpheme segmentations of
Wiktionary/Wikinflection and from the manually-evaluated morphological feature tags of the UniMorph project, and has 216K lemmas
and 5.4M word forms, in a total of 68 languages.