101. A Study on In-Vocabulary Word Segmentation.
- Author
-
WANG, ZHENXING, HUANG, CHANGNING, and ZHU, JINGBO
- Subjects
- *
VOCABULARY , *CHINESE language , *FOREIGN language education , *LEXICOLOGY , *COMPUTATIONAL linguistics , *ENCYCLOPEDIAS & dictionaries , *WORD formation (Grammar) - Abstract
Since the first Chinese Word Segmentation (CWS) Bakeoff in 2003, CWS has experienced a prominent flourish because Bakeoff provides a platform for the participants, which helps them recognize the merits and drawbacks of their segmenters. However, the evaluation metric of Bakeoff is not sufficient to measure the performance thoroughly; sometimes it is even misleading. One typical example caused by this insufficiency is that there is a popular belief existing in the research field that segmentation based on word can yield a better result than character-based tagging (CT) on in-vocabulary (IV) word segmentation even within closed tests of Bakeoff. Many efforts were paid to balance the performance on IV and out-of-vocabulary (OOV) words by combining these two methods according to this belief. In this paper, we provide a more detailed evaluation metric of IV and OOV words than Bakeoff to analyze CT method and combination method, which is a typical way to seek such a balance. Our evaluation metric shows that CT outperforms dictionary-based (or so-called word-based in general) segmentation on both IV and OOV words within Bakeoff closed tests. Furthermore, our analysis shows that using confidence measure to combine the two segmentation results should be under certain limitation. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF