会议专题

b-Bit Minwise Hashing in Practice

  Minwise hashing is a standard technique in the context of search for approximating set similarities.The recent work 26,32 demon- strated a potential use of b-bit minwise hashing 23,24 for ef- ficient search and learning on massive,high-dimensional,binary data (which are typical for many applications in Web search and text mining).In this paper,we focus on a number of critical is- sues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications.Minwise hashing requires an expensive preprocessing step that com- putes k (e.g.,500) minimal values after applying the correspond- ing permutations for each data vector.We developed a paralleliza- tion scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20  80 and becomes substantially smaller than the data loading time.Reducing the preprocessing time is highly beneficial in practice,e.g.,for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers.Another critical issue is that for very large data sets it becomes im- possible to store a (fully) random permutation matrix,due to its space requirements.Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions,e.g.,the 2-universal (2U) and 4-universal (4U) hash families,can produce very similar learning results as using fully random permu- tations.Experiments on datasets of up to 200GB are presented.

Ping Li Anshumali Shrivastava Arnd Christian K(O)nig

Department of Statistics & Biostatistics Deptartment of Computer Science Rutgers University,Piscataw Dept. of Computer Science Cornell University Ithaca,NY 14853 Microsoft Research Microsoft Corporation Redmond,WA 98052

国际会议

第五届亚太网构软件研讨会

长沙

英文

125-134

2013-10-23(万方平台首次上网日期,不代表论文的发表时间)