Back to Search
Start Over
UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation.
- Source :
- Applied Sciences (2076-3417); Dec2024, Vol. 14 Issue 23, p11435, 15p
- Publication Year :
- 2024
-
Abstract
- Word segmentation is a critical task in natural language processing for southeast Asian Abugida languages, including Thai, Burmese, and Khmer. Existing approaches demonstrate that models using fixed-length windowed context inputs can achieve high segmentation accuracy; however, they often rely on low-level character features or language-specific preprocessing. Character-based methods can limit feature learning, while language-specific features add complexity due to specialized preprocessing requirements. This paper introduces UnifiedCut, which is a neural model that leverages multiple n-grams within a windowed multi-head attention mechanism. This design captures segmentation features from local contexts and multi-perspective n-gram inputs, enhancing generalization and recall, particularly for out-of-vocabulary words. Compared to CNN- and RNN-based approaches, UnifiedCut's multi-head attention enables finer-grained feature extraction and greater parallelism, resulting in a faster, more scalable solution. Comprehensive experiments on public datasets for Thai, Burmese, and Khmer show that UnifiedCutachieves state-of-the-art performance in word segmentation. [ABSTRACT FROM AUTHOR]
- Subjects :
- NATURAL language processing
TRANSFORMER models
GENERALIZATION
Subjects
Details
- Language :
- English
- ISSN :
- 20763417
- Volume :
- 14
- Issue :
- 23
- Database :
- Complementary Index
- Journal :
- Applied Sciences (2076-3417)
- Publication Type :
- Academic Journal
- Accession number :
- 181655741
- Full Text :
- https://doi.org/10.3390/app142311435