会议专题

Detecting Image Spam using Visual Features and Near Duplicate Detection

Email spam is a much studied topic, but even though current email spam detecting software has been gaining a competitive edge against text based email spam, new advances in spam generation have posed a new challenge: image-based spam. Image based spam is email which includes embedded images containing the spam messages, but in binary format. In this paper, we study the characteristics of image spam to propose two solutions for detecting image-based spam, while drawing a comparison with the existing techniques. The.rst solution, which uses the visual features for classi.cation, o.ers an accuracy of about 98%, I.e. An improvement of at least 6% compared to existing solutions. SVMs (Support Vector Machines) are used to train classi.ers using judiciously decided color, texture and shape features. The second solution o.ers novel approach for near duplication detection in images. It involves clustering of image GMMs (Gaussian Mixture Models) based on the Agglomerative Information Bottleneck (AIB) principle, using Jensen-Shannon divergence (JS) as the distance measure.

Email spam Image analysis Machine learning

Bhaskar Mehta Saurabh Nangia Manish Gupta Wolfgang Nejdl

Google Inc.Brandschenkestr 110 Zurich, Switzerland IIT Guwahati Guwahati 781039 Assam, India L3S Forschungszentrum Appelstrasse 4 Hannover, Germany

国际会议

第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)

北京

英文

2008-04-21(万方平台首次上网日期,不代表论文的发表时间)