1. Domain-Aware Word Segmentation for Chinese Language: A Document-Level Context-Aware Model
- Author
-
HuangKaiyu, JinBo, XiaoKeli, LiuZhuang, HuangDegen, and MoFengran
- Subjects
General Computer Science ,business.industry ,Computer science ,Text segmentation ,Context (language use) ,computer.software_genre ,Domain (software engineering) ,Task (project management) ,Document level ,Linguistic sequence complexity ,Chinese language ,Artificial intelligence ,Chinese word ,business ,computer ,Natural language processing - Abstract
Word segmentation is an essential and challenging task in natural language processing, especially for the Chinese language due to its high linguistic complexity. Existing methods for Chinese word segmentation, including statistical machine learning methods and neural network methods, usually have good performance in specific knowledge domains. Given the increasing importance of interdisciplinary and cross-domain studies, one of the challenges in cross-domain word segmentation is to handle the out-of-vocabulary (OOV) words. Existing methods show unsatisfactory performance to meet the practical standard. To this end, we propose a document-level context-aware model that can automatically perceive and identify OOV words from different domains. Our method jointly implements a word-based and a character-based model and then processes the results with a newly proposed reconstruction model. We evaluate the new method by designing and conducting comprehensive experiments on two real-world datasets (e.g., news from different domains). The results demonstrate the superiority of our method over the state-of-the-art models in handling texts from different domains. Importantly, when doing the word segmentation under the cross-domain scenario, our proposed method can improve the performance of OOV words recognition.
- Published
- 2021
- Full Text
- View/download PDF