1. BERT-Based Arabic Diacritization: A state-of-the-art approach for improving text accuracy and pronunciation.
- Author
-
Kharsa, Ruba, Elnagar, Ashraf, and Yagi, Sane
- Subjects
- *
LANGUAGE models , *PRONUNCIATION - Abstract
In order to accurately represent the meaning and pronunciation of Arabic words and sentences, the presence of diacritics plays a crucial role. Over the years, researchers have dedicated significant efforts to enhancing automated diacritization systems. This paper introduces a novel approach for Arabic diacritization utilizing Bidirectional Encoder representations from Transformers (BERT) models. To evaluate the effectiveness of the proposed approach, two publicly available datasets, namely the Arabic Diacritization (AD) dataset and the Tashkeela Processed (TP) dataset, were employed. The performance of the models was assessed using various error metrics, including Diacritic Error Rate (DER) and Word Error Rate (WER). The findings demonstrate the superior performance of BERT in the diacritization process, surpassing all models employed in other diacritization systems. On the AD dataset, the proposed system achieved state-of-the-art (SOTA) syntactic DER and WER of 1.14% and 3.34%, respectively. For morphological diacritization, the best results yielded a DER of 0.92% and a WER of 1.91%. These outcomes reflect a remarkable relative error reduction of over 30% compared to previous research. Additionally, on the TP dataset, the BERT models exhibited a substantial decrease in DER, reducing the benchmark from 4.0% to 1.11%. Furthermore, this study introduces a real-time diacritization system called SUKOUN, which offers diacritized text through a user-friendly website. A comparison with existing automatic diacritization tools, using six example texts, reveals the superior prediction accuracy and preservation of input format provided by SUKOUN. • This study proposes Sukoun (an automated Arabic diacritization system). • Sukoun was validated using two publicly available AD and TP datasets. • Sukoun achieves state-of-the-art results both syntactically and morphologically. • Sukoun outperformed existing Arabic diacritization tools. • Sukoun reduced error rates by over 30%, highlighting the potential of the proposed model. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF