Skip to main content Skip to main navigation

Publikation

Word-Based and Character-BasedWord Segmentation Models: Comparison and Combination

Weiwei Sun
In: Chu-Ren Huang; Dan Jurafsky (Hrsg.). 23rd International Conference on Computational Linguistics. International Conference on Computational Linguistics (COLING-10), August 23-27, Beijing, China, Pages 1211-1219, Coling 2010 Organizing Committee, Coling 2010 Organizing Committee, 8/2010.

Zusammenfassung

We present a theoretical and empirical comparative analysis of the two dominant categories of approaches in Chinese word segmentation: word-based models and character-based models. We show that, in spite of similar performance overall, the two models produce different distribution of segmentation errors, in a way that can be explained by theoretical properties of the two models. The analysis is further exploited to improve segmentation accuracy by integrating a word-based segmenter and a character-based segmenter. A Bootstrap Aggregating model is proposed. By letting multiple segmenters vote, our model improves segmentation consistently on the four different data sets from the second SIGHAN bakeoff.

Projekte

Weitere Links