Research of Web Crawler and Web Information Extraction
With the rapid development of Internet and growing larger of web data, it is an urgent problem how to extract information from the web fast and efficiently. In order to make more fully and effectively use of web information, we get into the research specific to web information collection and information extraction technology. The information collection technology has included the web page grabbing, the extraction of URL and its optimization, as well as the strategy of preventing repeated grabbing and other key technologies. Based on these, this paper does research into the information extraction technology which is specific to the extraction of sample pages of information acquisition. According to the actual requirements, we design and implement an Information Extraction System based on Htmlparser. This system uses the web structure feature of tag as an information extraction rule template. The simulation shows the system has high accuracy, recall rate and practical application value.
information collection information extraction htmlparser fatures tag
Yongfeng DONG Bin GAO Hongyong GUO
Hebei University of Technology. Tianjin, China Hebei Institute of Science & Technology Information, Shijiazhuang, China
国际会议
北京
英文
377-380
2011-06-07(万方平台首次上网日期,不代表论文的发表时间)