Accurate document digitalization based on text recognition confidence estimation
Document digitalization is one of the basic technologies in multimedia information search and retrieval research area.It has offered a powerful way to bridge the gap between massive redundant image information and retrievable text.Although optical character recognition (OCR) technology has been widely applied to document digitalization projects, character misrecognition is inevitable due to picture downgrading caused by printed error, illumination or blurring variation.In some circumstances, a compromising scheme is to detect misrecognized characters accurately and leave them as embedded character images in the final electronic document.Thus, it is crucial to evaluate the recognition confidence for recognition error detection.In this paper, we propose a novel document digitalization method by combining traditional OCR technology with Convolutional Neural Networks(CNN) based text recognition confidence analysis.Briefly, samples are first processed by traditional OCR system to generate first stage recognition result.Usually, the error rate is below 2%, and then each recognized character is given a confidence value by an independent confidence estimator based on CNN, the recognized character with low confidence value is marked as misrecognized character.Experimental results show that our method has achieved an explicit improvement compared to baseline system.
Document Digitalization Optical Character Recognition Confidence estimation Convolutional Neural Networks
Pengchao Li Liangrui Peng Juan Wen
Tsinghua National Laboratory for Information Science and Technology, Dept.of Electronic Engineering, Equipment Academy, Beijing, 101416, China
国内会议
武汉
英文
124-137
2015-03-28(万方平台首次上网日期,不代表论文的发表时间)