会议专题

Accurate document digitalization based on text recognition confidence estimation

  Document digitalization is one of the basic technologies in multimedia information search and retrieval research area.It has offered a powerful way to bridge the gap between massive redundant image information and retrievable text.Although optical character recognition (OCR) technology has been widely applied to document digitalization projects, character misrecognition is inevitable due to picture downgrading caused by printed error, illumination or blurring variation.In some circumstances, a compromising scheme is to detect misrecognized characters accurately and leave them as embedded character images in the final electronic document.Thus, it is crucial to evaluate the recognition confidence for recognition error detection.In this paper, we propose a novel document digitalization method by combining traditional OCR technology with Convolutional Neural Networks(CNN) based text recognition confidence analysis.Briefly, samples are first processed by traditional OCR system to generate first stage recognition result.Usually, the error rate is below 2%, and then each recognized character is given a confidence value by an independent confidence estimator based on CNN, the recognized character with low confidence value is marked as misrecognized character.Experimental results show that our method has achieved an explicit improvement compared to baseline system.

Document Digitalization Optical Character Recognition Confidence estimation Convolutional Neural Networks

Pengchao Li Liangrui Peng Juan Wen

Tsinghua National Laboratory for Information Science and Technology, Dept.of Electronic Engineering, Equipment Academy, Beijing, 101416, China

国内会议

第十二届全国信息隐藏暨多媒体信息安全学术大会

武汉

英文

124-137

2015-03-28(万方平台首次上网日期,不代表论文的发表时间)