Back to Search Start Over

Integrating natural language processing with image document analysis: what we learned from two real-world applications.

Authors :
Chen, Jinying
Cao, Huaigu
Natarajan, Premkumar
Source :
International Journal on Document Analysis & Recognition; Sep2015, Vol. 18 Issue 3, p235-247, 13p
Publication Year :
2015

Abstract

Automatically accessing information from unconstrained image documents has important applications in business and government operations. These real-world applications typically combine optical character recognition (OCR) with language and information technologies, such as machine translation (MT) and keyword spotting. OCR output has errors and presents unique challenges to late-stage processing. This paper addresses two of these challenges: (1) translating the output from Arabic handwriting OCR which lacks reliable sentence boundary markers, and (2) searching named entities which do not exist in the OCR vocabulary, therefore, completely missing from Arabic handwriting OCR output. We address these challenges by leveraging natural language processing technologies, specifically conditional random field-based sentence boundary detection and out-of-vocabulary (OOV) name detection. This approach significantly improves our state-of-the-art MT system and achieves MT scores close to that achieved by human segmentation. The output from OOV name detection was used as a novel feature for discriminative reranking, which significantly reduced the false alarm rate of OOV name search on OCR output. Our experiments also show substantial performance gains from integrating a variety of features from multiple resources, such as linguistic analysis, image layout analysis, and image text recognition. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
14332833
Volume :
18
Issue :
3
Database :
Complementary Index
Journal :
International Journal on Document Analysis & Recognition
Publication Type :
Academic Journal
Accession number :
108952114
Full Text :
https://doi.org/10.1007/s10032-015-0247-x