会议专题

Knowledge Inference Model of OCR Conversion Error Rules Based on Chinese Character Construction Attributes Knowledge Graph

  OCR is a character conversion method based on image recognition.The complexity of the character and the image quality plays a key role in the con-version accuracy.The OCR conversion process has the characteristics of irregular conversion errors and the combination between incorrect conversion words and context of original location in certain text scenarios is established in semantic.In this paper,we propose an OCR conversion error rules inference model based on Chinese character construction attribute knowledge graph to analyze and inference the structure and complexity of Chinese characters.The model integrates a variety of coding methods,extracts features of entities and relationships of different data types with different encoder in the knowledge graph,uses convolutional neural networks to learn and inference the unknown error rules in the OCR conversion.In addition,in order to enable the triple feature matrix to fully contain the construc-tion attribute information of the Chinese characters,a feature crossover algorithm for feature diffusion of the triple feature matrix is introduced.In this algorithm,the relation matrix and the entities matrix are crossed to generate the new feature matrix which can better represent the triple of knowledge graph.The experimen-tal results show that,compared with the current mainstream knowledge inference model,the OCR conversion error rules inference model incorporating the feature cross algorithm has achieved important improvements in MRR,Hits@1,Hits@2 and other evaluation indicators on public data sets and task-related data sets.

Knowledge inference Knowledge graph OCR Convolutional neural network Text error correction

Xiaowen Zhang Hairong Wango Wenjie Gu

School of Computer Science and Engineering,North Minzu University,Yinchuan 750021,China

国际会议

9th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC 2020)

郑州

英文

1266-1276

2020-10-14(万方平台首次上网日期,不代表论文的发表时间)