Design and Implement of Information Extraction System Based on XML

摘要：

By studying the structure of HTML documents,this paper solves the problem of web information extraction through the standard XML technology and poses an information extraction method based on XML:construct HTMLDOM tree to implement Web cleaning and generate XHTML documents by analyzing HTML web,then analyze the XHTML files through the Xerces-Js DOM methods and construct an XPath generation algorithm;use the advantages of XSLT and XPath technology in the aspects of data location and conversion to automatically learn and generate the information extraction rules and implement the Web information extraction according to the generated XPath.

关键词： Information Extraction XML XPath XSLT Extraction Rule

作者: Yanyan Xuan Yan Hu

作者单位: Dept.Computer Science & Technology,Wuhan University of Technology,Wuhan,430070,China

会议类型: 国际会议

会议名称: 2008年国际电子商务、工程及科学领域的分布式计算和应用学术研讨会(2008 International Symposium on Distributed Computing and Applications for Business Engineering and Science)

会议地点: 大连

会议语种:英文

页码: 1400-1404

在线出版日期: 2008-07-27（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Design and Implement of Information Extraction System Based on XML