Back to Search Start Over

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

Authors :
Higashiyama, Shohei
Utiyama, Masao
Watanabe, Taro
Sumita, Eiichiro
Higashiyama, Shohei
Utiyama, Masao
Watanabe, Taro
Sumita, Eiichiro
Publication Year :
2023

Abstract

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.<br />Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Online,NAACL

Details

Database :
OAIster
Notes :
English
Publication Type :
Electronic Resource
Accession number :
edsoai.on1378466831
Document Type :
Electronic Resource