Back to Search Start Over

Enhancing Usability of Digital Collections: Accuracy Assessment and OCR Post-Correction of the Digital Museum of the Romanian Novel.

Authors :
VĂTAVU, Bogdan
MORARIU, David
Source :
Revista Transilvania. 2024, Issue 10, p66-75. 10p.
Publication Year :
2024

Abstract

This paper presents a methodology for assessing the accuracy of large collections of digital documents resulting from Optical Character Recognition (OCR) applied to their original print editions during scanning. We applied this methodology to the Digital Museum of the Romanian Novel, a digital collection of Romanian literary texts from the 19th and 20th centuries. With minimal text interventions, we compared each word token in our collection against custom-made lexicons to establish an OCR accuracy rate for each document. Additionally, this methodology provides hints regarding the common OCR mistakes in the collection that can be safely corrected, but also identifies potential improvements for the custom lexicons by listing and storing the possible additions. We believe that, with access to appropriate lexicons, this method can be applied to any language when working with large corpora of OCRed texts. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
02550539
Issue :
10
Database :
Academic Search Index
Journal :
Revista Transilvania
Publication Type :
Academic Journal
Accession number :
181714140
Full Text :
https://doi.org/10.51391/trva.2024.03.05