1. Enhancing Usability of Digital Collections: Accuracy Assessment and OCR Post-Correction of the Digital Museum of the Romanian Novel.
- Author
-
VĂTAVU, Bogdan and MORARIU, David
- Subjects
- *
OPTICAL character recognition , *ROMANIAN literature , *VIRTUAL museums , *ELECTRONIC records , *DIGITAL libraries , *NINETEENTH century - Abstract
This paper presents a methodology for assessing the accuracy of large collections of digital documents resulting from Optical Character Recognition (OCR) applied to their original print editions during scanning. We applied this methodology to the Digital Museum of the Romanian Novel, a digital collection of Romanian literary texts from the 19th and 20th centuries. With minimal text interventions, we compared each word token in our collection against custom-made lexicons to establish an OCR accuracy rate for each document. Additionally, this methodology provides hints regarding the common OCR mistakes in the collection that can be safely corrected, but also identifies potential improvements for the custom lexicons by listing and storing the possible additions. We believe that, with access to appropriate lexicons, this method can be applied to any language when working with large corpora of OCRed texts. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF