VIPS-based Web Cleaning Algorithm
On the basis of the features and drawbacks of traditional web page cleaning methods being analyzed, this paper presents a new VIPS-based web page cleaning algorithm.This algorithm first categorizes all the web pages with VIPS method, then computes the appearance frequency of each web page block in the website by judging the similarities among page blocks, and finally computes the importance of each web page block according to the appearance frequency,text quantity, position and the number of links, while the web page with the importance below the threshold is considered as noise content. The algorithm proposed is applied to KNN classification algorithm, and the experimental result proves that this algorithm could effectively improve the accuracy rate of web page classification.
Page Cleaning VIPS Page Classification
Hongxia Shi Xun Wang Yun Pan
College of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou, Zhejiang 310035, P.R.China
国际会议
杭州
英文
1107-1110
2006-10-12(万方平台首次上网日期,不代表论文的发表时间)