Duplicate text detection based on LCS algorithm
Broders Shingling and MinHash are two of the state-of-the-art approaches in detecting near-duplicate documents.But both of these two methods did not take the relative position of elements into consideration.This paper proposes a method which combines Shingling and LCS algorithm called SWLR (Shingling with Location Relationship).And proposes a pre-filter method to speed up the execution speed of SWLR.Experiment results shows that SWLR performances better than Shingling in both recall and precision rate and better than MinHash in recall rate.By applying pre-filter method,SWLR could even be executed faster than MinHash and Shingling.
near-duplicate detection duplicate detection duplicate text filter
JIANKUN YU MENGRONG LI DENGYIN ZHANG
Nanjing University of Posts and Communications,No.66,Xin Mofan Road,Nanjing,China Jiangsu Nanyou IoT Science Park Co.Ltd,No.66,Xin Mofan Road,Nanjing,China
国际会议
重庆
英文
5-9
2016-05-21(万方平台首次上网日期,不代表论文的发表时间)