Back to Search
Start Over
A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging.
- Source :
-
Digital Scholarship in the Humanities . Dec2023, Vol. 38 Issue 4, p1604-1612. 9p. - Publication Year :
- 2023
-
Abstract
- With the rapid growth of online content written in the Kurdish language, there is an increasing need to make it machine-readable and processable. Part of speech (POS) tagging is a critical aspect of natural language processing (NLP), playing a significant role in applications such as speech recognition, natural language parsing, information retrieval, and multiword term extraction. This study details the creation of the DASTAN corpus, the first POS-annotated corpus for the Sorani Kurdish dialect. The corpus, containing 74,258 words and thirty-eight tags, employs a hybrid approach utilizing the bigram hidden Markov model in combination with the Kurdish rule-based approach to POS tagging. This approach addresses two key problems that arise with rule-based approaches, namely misclassified words and ambiguity-related unanalyzed words. The proposed approach's accuracy was assessed by training and testing it on the DASTAN corpus, yielding a 96% accuracy rate. Overall, this study's findings demonstrate the effectiveness of the proposed hybrid approach and its potential to enhance NLP applications for Sorani Kurdish. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 2055768X
- Volume :
- 38
- Issue :
- 4
- Database :
- Academic Search Index
- Journal :
- Digital Scholarship in the Humanities
- Publication Type :
- Academic Journal
- Accession number :
- 174444647
- Full Text :
- https://doi.org/10.1093/llc/fqad066