会议专题

Duplicate text detection based on LCS algorithm

  Broders Shingling and MinHash are two of the state-of-the-art approaches in detecting near-duplicate documents.But both of these two methods did not take the relative position of elements into consideration.This paper proposes a method which combines Shingling and LCS algorithm called SWLR (Shingling with Location Relationship).And proposes a pre-filter method to speed up the execution speed of SWLR.Experiment results shows that SWLR performances better than Shingling in both recall and precision rate and better than MinHash in recall rate.By applying pre-filter method,SWLR could even be executed faster than MinHash and Shingling.

near-duplicate detection duplicate detection duplicate text filter

JIANKUN YU MENGRONG LI DENGYIN ZHANG

Nanjing University of Posts and Communications,No.66,Xin Mofan Road,Nanjing,China Jiangsu Nanyou IoT Science Park Co.Ltd,No.66,Xin Mofan Road,Nanjing,China

国际会议

2016信息技术和机电一体化工程国际会议

重庆

英文

5-9

2016-05-21(万方平台首次上网日期,不代表论文的发表时间)