Back to Search Start Over

UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation.

Authors :
Wen, Yonghua
Xian, Yantuan
Wang, Yuehan
Yu, Zhengtao
Source :
Applied Sciences (2076-3417); Dec2024, Vol. 14 Issue 23, p11435, 15p
Publication Year :
2024

Abstract

Word segmentation is a critical task in natural language processing for southeast Asian Abugida languages, including Thai, Burmese, and Khmer. Existing approaches demonstrate that models using fixed-length windowed context inputs can achieve high segmentation accuracy; however, they often rely on low-level character features or language-specific preprocessing. Character-based methods can limit feature learning, while language-specific features add complexity due to specialized preprocessing requirements. This paper introduces UnifiedCut, which is a neural model that leverages multiple n-grams within a windowed multi-head attention mechanism. This design captures segmentation features from local contexts and multi-perspective n-gram inputs, enhancing generalization and recall, particularly for out-of-vocabulary words. Compared to CNN- and RNN-based approaches, UnifiedCut's multi-head attention enables finer-grained feature extraction and greater parallelism, resulting in a faster, more scalable solution. Comprehensive experiments on public datasets for Thai, Burmese, and Khmer show that UnifiedCutachieves state-of-the-art performance in word segmentation. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
20763417
Volume :
14
Issue :
23
Database :
Complementary Index
Journal :
Applied Sciences (2076-3417)
Publication Type :
Academic Journal
Accession number :
181655741
Full Text :
https://doi.org/10.3390/app142311435