1. An algorithm of line segmentation and reading order sorting based on adjacent character detection: A post-processing of OCR for digitization of Chinese historical texts.
- Author
-
Lee, Aram, Yu, HongYeon, and Min, Gihyeon
- Subjects
- *
OPTICAL character recognition , *CLADISTIC analysis , *COLUMNS , *DIGITIZATION , *ALGORITHMS , *DIGITAL technology , *COMPOSITE columns - Abstract
• OCR-detected characters or words in an image require line segmentation to reunite into word or sentence. • Projection based line segmentation cannot be applied to Chinese historical texts. • Columns in Chinese historical texts exhibit a sub-divided column format. • Adjacent character detection (ACD) algorithm can deal with the unique column structure. • Performance of ACD algorithm as a post-processing of OCR is discussed. In recent times, the advent of AI-based optical character recognition (OCR) has garnered significant attention in the realm of digital text conversion. However, it is imperative to note that OCR solely identifies individual characters or words, and lacks the ability to reunite them into cohesive units such as words or sentences. Consequently, the manual sorting of them to establish the appropriate reading order has emerged as a bottleneck. In this paper, we present an algorithm termed adjacent character detection (ACD), designed to serve as a post-processing of OCR, facilitating automatic digital text conversion. The algorithm involves line segmentation through a quad-ACD scan (up-down-down-up), allowing it to consecutively discern characters within a column based on their adjacency relations. Conventional projection profile analyses have struggled to effectively partition the distinct internal structure of Chinese historical text, where two annotation columns often subdivide from a single body column. In contrast, our ACD algorithm employs an approach, reuniting adjacent characters rather than fragmenting the entire text into isolated entities. Additionally, ACD algorithm enabled body/annotation classification for OCR-detected characters based on the pattern analysis of its quad scan. This cumulative information empowers the conversion of digital text in a desired reading order. To assess the efficacy of the proposed algorithm, a set of ground-truth OCR result was subjected to rigorous testing, culminating in a reading order accuracy of 98.6%. Noteworthy robustness was also demonstrated in the face of misaligned columns, experimentally induced by applying tilt, warp, and wavy noises to the original digital images. Lastly, the algorithm was integrated with two pre-developed OCR models, resulting in a reading order accuracy of 97.7%. [Display omitted] [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF