A Novel Method of Chinese Web Information Eztraction and Applications

摘要：

One promising application of natural language processing (NLP) research is in the area of information extraction (IE). In this paper, we present work flow of our IE system for the extraction of semantically rich information from the unstructured or semi-structured Chinese web pages. Knowledge engineering approach and automatic training approach are used to extract pattern and built knowledge repository. General IE system needs to label the unlabeled training web pages. A novel methodology that does not need to label text is developed, including hierarchy filtration pattern matching based on syntax in best distance method and maximum forward boundary recognition using organization suffix repository and part of speech tagging method.As for applications of IE, a new application system based on IE is built. It is object-level vertical search system and object here is Chinese people, so IE is concerned with extracting peoples related attributes from a collection of web pages about Chinese people. The results are displayed as hierarchy directory tree according to peoples attributes. The system makes user find people quickly and easily.

关键词： natural language processing (NLP) information eztraction (IE) machine learning(ML)

作者: Zhong Liu Ying Wang

作者单位: Chengdu Institute of Computer Applications, Chinese Academy of Sciences Sichuan College of Architect Chengdu Institute of Computer Applications, Chinese Academy of Sciences Chengdu, China

会议类型: 国际会议

会议名称: 2009 WASE International Conference on Information Engineering(2009年国际信息工程会议)(ICIE 2009)

会议地点: 太原

会议语种:英文

页码: 65-68

在线出版日期: 2009-07-10（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Novel Method of Chinese Web Information Eztraction and Applications