Back to Search Start Over

A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging.

Authors :
Maulud, Dastan
Jacksi, Karwan
Ali, Ismael
Source :
Digital Scholarship in the Humanities. Dec2023, Vol. 38 Issue 4, p1604-1612. 9p.
Publication Year :
2023

Abstract

With the rapid growth of online content written in the Kurdish language, there is an increasing need to make it machine-readable and processable. Part of speech (POS) tagging is a critical aspect of natural language processing (NLP), playing a significant role in applications such as speech recognition, natural language parsing, information retrieval, and multiword term extraction. This study details the creation of the DASTAN corpus, the first POS-annotated corpus for the Sorani Kurdish dialect. The corpus, containing 74,258 words and thirty-eight tags, employs a hybrid approach utilizing the bigram hidden Markov model in combination with the Kurdish rule-based approach to POS tagging. This approach addresses two key problems that arise with rule-based approaches, namely misclassified words and ambiguity-related unanalyzed words. The proposed approach's accuracy was assessed by training and testing it on the DASTAN corpus, yielding a 96% accuracy rate. Overall, this study's findings demonstrate the effectiveness of the proposed hybrid approach and its potential to enhance NLP applications for Sorani Kurdish. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
2055768X
Volume :
38
Issue :
4
Database :
Academic Search Index
Journal :
Digital Scholarship in the Humanities
Publication Type :
Academic Journal
Accession number :
174444647
Full Text :
https://doi.org/10.1093/llc/fqad066