会议专题

A Maximum Similarity Matching Algorithm for Noise Reduction in Web Pages Based on LCS

This paper presents a maximum similarity matching algorithm for noise reduction in Web Pages based on Longest Common Subsequence (LCS). More specifically, Parsing target page and similar pages into two characteristic trees, and map them to two characteristic node sequences, the LCS algorithm can get the longest sub-sequence which is global optimal solution and find out the different characteristic nodes between the two characteristic tree as a candidate set, clustering the candidate set and scoring to identify web page important informative block. In this paper, the algorithm prototype is given, and has described the implementation of each module. At last. Experiments on a set of thousands of web pages from 4 different sites show that the algorithm is practical, and can achieve average 95.1% high accuracy.

LCS Characteristic Tree Noise reduction in WebPages

Luo chuanfei Song ao

Dept.Electronic Engineering Shanghai Jiao Tong University (SJTU) Shanghai,China

国际会议

2011 3rd International Conference on Computer and Network Technology(ICCNT 2011)(2011第三届IEEE计算机与网络技术国际会议)

太原

英文

654-657

2011-02-26(万方平台首次上网日期,不代表论文的发表时间)