会议专题

Research of Web Crawler and Web Information Extraction

With the rapid development of Internet and growing larger of web data, it is an urgent problem how to extract information from the web fast and efficiently. In order to make more fully and effectively use of web information, we get into the research specific to web information collection and information extraction technology. The information collection technology has included the web page grabbing, the extraction of URL and its optimization, as well as the strategy of preventing repeated grabbing and other key technologies. Based on these, this paper does research into the information extraction technology which is specific to the extraction of sample pages of information acquisition. According to the actual requirements, we design and implement an Information Extraction System based on Htmlparser. This system uses the web structure feature of tag as an information extraction rule template. The simulation shows the system has high accuracy, recall rate and practical application value.

information collection information extraction htmlparser fatures tag

Yongfeng DONG Bin GAO Hongyong GUO

Hebei University of Technology. Tianjin, China Hebei Institute of Science & Technology Information, Shijiazhuang, China

国际会议

International Council for Scientific and Technical Information Annual Conference(国际科技信息委员会2011年夏季年会 ICSTI 2011)

北京

英文

377-380

2011-06-07(万方平台首次上网日期,不代表论文的发表时间)