会议专题

VIPS-based Web Cleaning Algorithm

On the basis of the features and drawbacks of traditional web page cleaning methods being analyzed, this paper presents a new VIPS-based web page cleaning algorithm.This algorithm first categorizes all the web pages with VIPS method, then computes the appearance frequency of each web page block in the website by judging the similarities among page blocks, and finally computes the importance of each web page block according to the appearance frequency,text quantity, position and the number of links, while the web page with the importance below the threshold is considered as noise content. The algorithm proposed is applied to KNN classification algorithm, and the experimental result proves that this algorithm could effectively improve the accuracy rate of web page classification.

Page Cleaning VIPS Page Classification

Hongxia Shi Xun Wang Yun Pan

College of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou, Zhejiang 310035, P.R.China

国际会议

2006 International Symposium on Distributed Computing and Applications to Business,Engineering and Science(2006年国际电子、工程及科学领域的分布式计算应用学术研讨会)

杭州

英文

1107-1110

2006-10-12(万方平台首次上网日期,不代表论文的发表时间)