1. Ground Truth OCR Sample Data of Finnish Historical Newspapers and Journals in Data Improvement Validation of a re-OCRing Process
- Author
-
Jukka Kervinen, Kimmo Kettunen, and Mika Koistinen
- Subjects
History ,Process (engineering) ,Sample (statistics) ,02 engineering and technology ,Library and Information Sciences ,computer.software_genre ,Ephemera ,Newspaper ,Bibliography. Library science. Information resources ,Finnish historical newspapers ,0202 electrical engineering, electronic engineering, information engineering ,OCR quality ,Legal deposit ,Ground truth ,Information retrieval ,evaluation ,National library ,ground truth data ,05 social sciences ,Optical character recognition ,020201 artificial intelligence & image processing ,measurement ,0509 other social sciences ,050904 information & library sciences ,computer - Abstract
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 16.51 million pages mainly in Finnish and Swedish. Out of these about 7.64 million pages are freely available on the web site https://digi.kansalliskirjasto.fi/etusivu . The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The last nine years, 1921–1929, were opened in January 2018. This paper presents briefly the ground truth Optical Character Recognition data of about 500 000 words that has been compiled at the NLF for development of an improved OCR process for the Finnish collection. We discuss compilation of the data generally and show results of the new OCR process in comparison to current OCR, using the ground truth data as an evaluation benchmark. We also show with real newspaper data of 30 years and 109 million words that the re-OCRing process is improving the quality of the OCRed data.
- Published
- 2020