A Novel Method for Extracting Entity Data from Deep Web Precisely

摘要：

　　In order to make better use of the hidden information value in the Deep Web,get fast and accurate access to the embedded entity data,this paper presented a method for extracting entity data from Deep Web precisely,designed a entity extraction system,which will extract data from Deep Web automatically.Firstly,designed a web crawler based on the characteristics of Deep Web,take advantage of the web crawler to get resources from Internet; Secondly,the pretreatment of web resources,normalize the pages which are non-standard; Finally,locate and extract the entity data from Deep Web accurately,in this paper,based on the hierarchy and layout features in DOM tree,combined XPath with RegExp to locate entity data,then stored the extracted entity attributes and attribute values.Experiments show that,using this method can locate and extract the entity data from Deep Web quickly and efficiently,and achieved a higher accuracy.

关键词： Deep Web DOM Entity Extraction

作者: YU Hai-tao GUO Jian-yi YU Zheng-tao XIAN Yan-tuan YAN Xin

作者单位: School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Intelligent Information Processing Key Laboratory,Kunming University of Science and Technology,Kunming 650500,China

会议类型: 国际会议

会议名称: 第26届中国控制与决策会议(2014 CCDC)

会议地点: 长沙

会议语种:英文

页码: 5049-5053

在线出版日期: 2014-05-31（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Novel Method for Extracting Entity Data from Deep Web Precisely