A Novel Method for Extracting Entity Data from Deep Web Precisely
In order to make better use of the hidden information value in the Deep Web,get fast and accurate access to the embedded entity data,this paper presented a method for extracting entity data from Deep Web precisely,designed a entity extraction system,which will extract data from Deep Web automatically.Firstly,designed a web crawler based on the characteristics of Deep Web,take advantage of the web crawler to get resources from Internet; Secondly,the pretreatment of web resources,normalize the pages which are non-standard; Finally,locate and extract the entity data from Deep Web accurately,in this paper,based on the hierarchy and layout features in DOM tree,combined XPath with RegExp to locate entity data,then stored the extracted entity attributes and attribute values.Experiments show that,using this method can locate and extract the entity data from Deep Web quickly and efficiently,and achieved a higher accuracy.
Deep Web DOM Entity Extraction
YU Hai-tao GUO Jian-yi YU Zheng-tao XIAN Yan-tuan YAN Xin
School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Intelligent Information Processing Key Laboratory,Kunming University of Science and Technology,Kunming 650500,China
国际会议
长沙
英文
5049-5053
2014-05-31(万方平台首次上网日期,不代表论文的发表时间)