会议专题

A Web Page Segmentation Algorithm for Extracting Product Information

Nowadays, as the rapid development of Internet, web is becoming the most popular and also the largest resource for people to acquire information. At the same time, search engine plays an important role while retrieving information. Nevertheless, the smallest processing unit of search engine is the whole web pages, which contains plenty of noisy information. If the information can be extracted and used as the smallest processing unit, then it can place a positive effect on search engines precision; so was born the page segmentation algorithm. However, traditional algorithms cannot extract blocks in product level. Hence, a novel algorithm, basing on product features and DOM (Document Object Mode), is proposed. Compared with those traditional algorithms, not only information consistence is greatly enhanced, but also complexity is decreased with this novel page segmentation algorithm.

Information Retrieval Search Engine Product Block Page Segmentation.

Changjun Wu Guosun Zeng Guorong Xu

Department of Computer Science and Technology, Tongji University, Shanghai 201804, China;Tongji Branch, National Engineering & Technology Center of High Performance Computer, Shanghai 201804, China

国际会议

2006 IEEE International Conference on Information Acquisition

山东威海

英文

1374-1379

2006-08-20(万方平台首次上网日期,不代表论文的发表时间)