A Novel Method of Chinese Web Information Eztraction and Applications
One promising application of natural language processing (NLP) research is in the area of information extraction (IE). In this paper, we present work flow of our IE system for the extraction of semantically rich information from the unstructured or semi-structured Chinese web pages. Knowledge engineering approach and automatic training approach are used to extract pattern and built knowledge repository. General IE system needs to label the unlabeled training web pages. A novel methodology that does not need to label text is developed, including hierarchy filtration pattern matching based on syntax in best distance method and maximum forward boundary recognition using organization suffix repository and part of speech tagging method.As for applications of IE, a new application system based on IE is built. It is object-level vertical search system and object here is Chinese people, so IE is concerned with extracting peoples related attributes from a collection of web pages about Chinese people. The results are displayed as hierarchy directory tree according to peoples attributes. The system makes user find people quickly and easily.
natural language processing (NLP) information eztraction (IE) machine learning(ML)
Zhong Liu Ying Wang
Chengdu Institute of Computer Applications, Chinese Academy of Sciences Sichuan College of Architect Chengdu Institute of Computer Applications, Chinese Academy of Sciences Chengdu, China
国际会议
2009 WASE International Conference on Information Engineering(2009年国际信息工程会议)(ICIE 2009)
太原
英文
65-68
2009-07-10(万方平台首次上网日期,不代表论文的发表时间)