1. Automatic processing of Historical Arabic Documents: A comprehensive Survey
- Author
-
Houda Jmila, Mounim A. El-Yacoubi, Mohamed Ibn Khedher, IRT SystemX (IRT SystemX), Département Réseaux et Services de Télécommunications (RST), Institut Mines-Télécom [Paris] (IMT)-Télécom SudParis (TSP), Institut Polytechnique de Paris (IP Paris), Réseaux, Systèmes, Services, Sécurité (R3S-SAMOVAR), Services répartis, Architectures, MOdélisation, Validation, Administration des Réseaux (SAMOVAR), Institut Mines-Télécom [Paris] (IMT)-Télécom SudParis (TSP)-Institut Mines-Télécom [Paris] (IMT)-Télécom SudParis (TSP), Département Electronique et Physique (EPH), and ARMEDIA (ARMEDIA-SAMOVAR)
- Subjects
Survey on Historical Arabic Documents ,Computer science ,Arabic ,Process (engineering) ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Task (project management) ,Data retrieval ,Artificial Intelligence ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Historical Arabic Documents ,010306 general physics ,Arabic script ,Text recognition ,Writer identification ,Focus (computing) ,Information retrieval ,Subject (documents) ,Text analysis ,language.human_language ,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,Scripting language ,Signal Processing ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,language ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,computer ,Software - Abstract
International audience; Nowadays, there is a huge amount of Historical Arabic Documents (HAD) in the national libraries and archives around the world. Analyzing this type of data manually is a difficult and costly task. Thus, an automatic process is required to exploit these documents more rapidly. Processing historical documents is a recent research subject that has seen a remarkable growth in the last years. Processing Historical Arabic Documents is a particularly challenging problem. First, due to complicated nature of Arabic script compared to other scripts and second because the documents are ancient. This paper focuses on this difficult problem and provides a comprehensive survey of existing research work. First, we describe in detail the challenges making the automatic processing of Historical Arabic Documents a difficult task. Second, we classify this task into four applications of automatic processing of HAD: i) Analyze the document to extract the main text ii) Identify the writer of the document iii) Recognize some words or parts of the document in a reference dataset andiv) Retrieve and extract specific data from the document. For each application, existing approaches are surveyed and qualitatively described. Finally, we focus on available datasets and describe how they can be used in each application.
- Published
- 2020
- Full Text
- View/download PDF