A Maximum Similarity Matching Algorithm for Noise Reduction in Web Pages Based on LCS
This paper presents a maximum similarity matching algorithm for noise reduction in Web Pages based on Longest Common Subsequence (LCS). More specifically, Parsing target page and similar pages into two characteristic trees, and map them to two characteristic node sequences, the LCS algorithm can get the longest sub-sequence which is global optimal solution and find out the different characteristic nodes between the two characteristic tree as a candidate set, clustering the candidate set and scoring to identify web page important informative block. In this paper, the algorithm prototype is given, and has described the implementation of each module. At last. Experiments on a set of thousands of web pages from 4 different sites show that the algorithm is practical, and can achieve average 95.1% high accuracy.
LCS Characteristic Tree Noise reduction in WebPages
Luo chuanfei Song ao
Dept.Electronic Engineering Shanghai Jiao Tong University (SJTU) Shanghai,China
国际会议
太原
英文
654-657
2011-02-26(万方平台首次上网日期,不代表论文的发表时间)