Effective Indices for Efficient Approximate String Search and Similarity Join

摘要：

Data collections often have inconsistencies that arise due to a variety of reasons, and it is desirable to be able to identify and resolve them efficiently. Similarity queries are commonly used in data cleaning for matching similar data. In this work we concentrate on the following problem of approximate string matching based on edit distance: from a collection of strings, how to find those strings similar to a given string, or the strings in another collection of strings with similarity greater than some threshold? We propose an NFA-based (Nondeterministic Finitestate Automation) method for effective approximate string search. We model strings as a trie and construct an NFA on top of the trie. We identify the similar strings by running the NFA based on the tree automata theory. Moreover, we propose grouped trie to further improve the performance of similarity search by incorporating some effective pruning techniques. We have implemented our method and the experimental results show that our approach achieves high performance and outperforms the existing state-of-the-art methods by orders of magnitude.

作者: Xuhui Liu Guoliang Li Jianhua Feng Lizhu Zhou

作者单位: Department of Computer Science and Technology,Tsinghua University,Beijing 100084,P.R.China

会议类型: 国际会议

会议名称: The Ninth International Conference on Web-Age Information Management(第九届web时代信息管理国际会议)(WAIM 2008)

会议地点: 张家界

会议语种:英文

在线出版日期: 2008-07-20（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Effective Indices for Efficient Approximate String Search and Similarity Join