Exploiting Concept Clumping for Efficient Incremental E-Mail Categorization

摘要：

We introduce a novel approach to incremental e-mail categorization based on identifying and exploiting clumps of messages that are classified similarly. Clumping reflects the local coherence of a classification scheme and is particularly important in a setting where the classification scheme is dynamically changing, such as in e-mail categorization. We propose a number of metrics to quantify the degree of clumping in a series of messages. We then present a number of fast, incremental methods to categorize messages and compare the performance of these methods with measures of the clumping in the datasets to show how clumping is being exploited by these methods. The methods are tested on 7 large real-world e-mail datasets of 7 users from the Enron corpus, where each message is classified into one folder. We show that our methods perform well and provide accuracy comparable to several common machine learning algorithms, but with much greater computational efficiency.

关键词： concept drift e-mail classification

作者: Alfred Krzywicki Wayne Wobcke

作者单位: School of Computer Science and Engineering University of New South Wales Sydney NSW 2052, Australia

会议类型: 国际会议

会议名称: 6th International Conference on Advanced Data Mining and Applications(第六届先进数据挖掘及应用国际会议 ADMA 2010)

会议地点: 重庆

会议语种:英文

页码: 244-258

在线出版日期: 2010-11-19（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Exploiting Concept Clumping for Efficient Incremental E-Mail Categorization