Back to Search
Start Over
Automatic identification of arabic dialects in social media
- Source :
- SoMeRA@SIGIR
- Publication Year :
- 2014
- Publisher :
- ACM, 2014.
-
Abstract
- Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for AD classification using probabilistic models across social media datasets. We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different conditions in social media context. Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable overall accuracy of 98%. This work is a first-step towards an ultimate goal of a translation system from Arabic to English and French, within the ASMAT project
- Subjects :
- business.industry
Computer science
Arabic
Character (computing)
Context (language use)
computer.software_genre
language.human_language
Naive Bayes classifier
Identification (information)
Formal language
Modern Standard Arabic
language
Arabic morphology
Social media
Artificial intelligence
Language model
business
computer
Natural language processing
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- Proceedings of the first international workshop on Social media retrieval and analysis
- Accession number :
- edsair.doi...........c2d96a08fdb1c266c6b1ff1207fa3f65