Research of Web Crawler and Web Information Extraction

摘要：

With the rapid development of Internet and growing larger of web data, it is an urgent problem how to extract information from the web fast and efficiently. In order to make more fully and effectively use of web information, we get into the research specific to web information collection and information extraction technology. The information collection technology has included the web page grabbing, the extraction of URL and its optimization, as well as the strategy of preventing repeated grabbing and other key technologies. Based on these, this paper does research into the information extraction technology which is specific to the extraction of sample pages of information acquisition. According to the actual requirements, we design and implement an Information Extraction System based on Htmlparser. This system uses the web structure feature of tag as an information extraction rule template. The simulation shows the system has high accuracy, recall rate and practical application value.

关键词： information collection information extraction htmlparser fatures tag

作者: Yongfeng DONG Bin GAO Hongyong GUO

作者单位: Hebei University of Technology. Tianjin, China Hebei Institute of Science & Technology Information, Shijiazhuang, China

会议类型: 国际会议

会议名称: International Council for Scientific and Technical Information Annual Conference(国际科技信息委员会2011年夏季年会 ICSTI 2011)

会议地点: 北京

会议语种:英文

页码: 377-380

在线出版日期: 2011-06-07（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Research of Web Crawler and Web Information Extraction