HTML Tree Parsing Algorithm Based on Pre-eztracted Data

摘要：

In the paper, a new method of extracting HTML Tree from web pages is proposed.Its main idea is that the parts of web pages which are not easy to parse including tags and attributes should be handled previously, then the remaining parts are tidied and parsed, and then both the two former extracted parts are deposited in the tree.As integrated the tidying process and the parsing process, the new method does not only keep the web data integrity but also simplify the complexity of algorithms.The test shows that it can parse all kinds of web pages and provide concrete fault tolerance mechanisms.

关键词： HTML parsing web pages tidying information eztracting

作者: Mingqiu Song Ruixue Zhang Duo Gang

作者单位: Institute of Systems Engineering Dalian University of Technology,Dalian 116023,China

会议类型: 国际会议

会议名称: 第八届国际移动商务会议(Eighth International Conference on Mobile Business)

会议地点: 大连

会议语种:英文

页码: 249-254

在线出版日期: 2009-06-27（万方平台首次上网日期，不代表论文的发表时间）

会议专题

HTML Tree Parsing Algorithm Based on Pre-eztracted Data