1. Processing a large collection of historical tabular images.
- Author
-
Granell, Emilio, Romero, Verónica, Prieto, José Ramón, Andrés, José, Quirós, Lorenzo, Sánchez, Joan Andreu, and Vidal, Enrique
- Subjects
- *
DOCUMENT imaging systems , *DATABASES , *HISTORICAL source material , *IMAGE databases , *IMAGE analysis , *TEXT recognition , *OPTICAL character recognition - Abstract
• Presents the process to create a new database of historical tabular manuscripts. • The created database is called HisClima and is freely accessible. • Interactive DLA was used to ease text line detection and table structure creation. • Computer assisted transcription was used to ease the text information acquisition. • Baseline results for DLA, Text Recognition and Probabilistic Indexing are provided. Processing automatically historical document images to allow the search of textual information requires the preparation of ground-truth data for training and evaluation. This process is an expensive and arduous task, especially when the historical document images contain specialized vocabulary and/or tabular information. In the latter case, relevant decisions have to be taken to annotate the tabular parts. This paper presents a complex collection of historical document images and the resulting database, which is called HisClima. In this database, half of the images are in tabular format and half as running text. Both types of images contain pre-printed and handwritten text. The textual information is plenty of abbreviations and specific vocabulary related to weather conditions and old ships. This database can be used to research technologies related to historical document image processing and analysis, both for tabular and running text recognition. Baseline results are presented for Document Layout Analysis, Text Recognition, and Probabilistic Indexing. Although these results are good, there is still room for improvement and some indications are provided in this direction. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF