b-Bit Minwise Hashing in Practice

摘要：

　　Minwise hashing is a standard technique in the context of search for approximating set similarities.The recent work 26,32 demon- strated a potential use of b-bit minwise hashing 23,24 for ef- ficient search and learning on massive,high-dimensional,binary data (which are typical for many applications in Web search and text mining).In this paper,we focus on a number of critical is- sues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications.Minwise hashing requires an expensive preprocessing step that com- putes k (e.g.,500) minimal values after applying the correspond- ing permutations for each data vector.We developed a paralleliza- tion scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 80 and becomes substantially smaller than the data loading time.Reducing the preprocessing time is highly beneficial in practice,e.g.,for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers.Another critical issue is that for very large data sets it becomes im- possible to store a (fully) random permutation matrix,due to its space requirements.Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions,e.g.,the 2-universal (2U) and 4-universal (4U) hash families,can produce very similar learning results as using fully random permu- tations.Experiments on datasets of up to 200GB are presented.

作者: Ping Li Anshumali Shrivastava Arnd Christian K(O)nig

作者单位: Department of Statistics & Biostatistics Deptartment of Computer Science Rutgers University,Piscataw Dept. of Computer Science Cornell University Ithaca,NY 14853 Microsoft Research Microsoft Corporation Redmond,WA 98052

会议类型: 国际会议

会议名称: 第五届亚太网构软件研讨会

会议地点: 长沙

会议语种:英文

页码: 125-134

在线出版日期: 2013-10-23（万方平台首次上网日期，不代表论文的发表时间）

会议专题

b-Bit Minwise Hashing in Practice