1. Evaluating and mitigating the impact of OCR errors on information retrieval.
- Author
-
de Oliveira, Lucas Lima, Vargas, Danny Suarez, Alexandre, Antônio Marcelo Azevedo, Cordeiro, Fábio Corrêa, Gomes, Diogo da Silva Magalhães, Rodrigues, Max de Castro, Romeu, Regis Kruel, and Moreira, Viviane Pereira
- Subjects
- *
OPTICAL character recognition , *INFORMATION retrieval , *DIGITIZATION - Abstract
Optical character recognition (OCR) is typically used to extract the textual contents of scanned texts. The output of OCR can be noisy, especially when the quality of the scanned image is poor, which in turn can impact downstream tasks such as information retrieval (IR). Post-processing OCR-ed documents is an alternative to fix digitization errors and, intuitively, improve the results of downstream tasks. This work evaluates the impact of OCR digitization and correction on IR. We compared different digitization and correction methods on real OCR-ed data from an IR test collection with 22k documents and 34 query topics on the geoscientific domain in Portuguese. Our results have shown significant differences in IR metrics for the different digitization methods (up to 5 percentage points in terms of mean average precision). Regarding the impact of error correction, our results showed that on the average for the complete set of query topics, retrieval quality metrics change very little. However, a more detailed analysis revealed it improved 19 out of 34 query topics. Our findings indicate that, contrary to previous work, long documents are impacted by OCR errors. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF