Duplicate text detection based on LCS algorithm

摘要：

　　Broders Shingling and MinHash are two of the state-of-the-art approaches in detecting near-duplicate documents.But both of these two methods did not take the relative position of elements into consideration.This paper proposes a method which combines Shingling and LCS algorithm called SWLR (Shingling with Location Relationship).And proposes a pre-filter method to speed up the execution speed of SWLR.Experiment results shows that SWLR performances better than Shingling in both recall and precision rate and better than MinHash in recall rate.By applying pre-filter method,SWLR could even be executed faster than MinHash and Shingling.

关键词： near-duplicate detection duplicate detection duplicate text filter

作者: JIANKUN YU MENGRONG LI DENGYIN ZHANG

作者单位: Nanjing University of Posts and Communications,No.66,Xin Mofan Road,Nanjing,China Jiangsu Nanyou IoT Science Park Co.Ltd,No.66,Xin Mofan Road,Nanjing,China

会议类型: 国际会议

会议名称: 2016信息技术和机电一体化工程国际会议

会议地点: 重庆

会议语种:英文

页码: 5-9

在线出版日期: 2016-05-21（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Duplicate text detection based on LCS algorithm