Text Categorization of Enron Email Corpus Based on Information Bottleneck and Maximal Entropy

摘要：

This paper is for text categorization of Enron email corpus, we use the information bottleneck (IB) method to cluster the key words based on their distribution on different class labels, then we use threads and address groups as additional features to email texts, and the maximal entropy model to improve the accuracy of the classifier. Our experimental results shows that these measures can improve the classifiers performances, for keywords change too rapidly in emails while address groups are much steadier.

关键词： text categorization email corpus data mining

作者: Man Wang Yifan He Minghu Jiang

作者单位: Lab of Computational Linguistics, School of Humanities and Social Sciences,Tsinghua University, Beijing 100084, China

会议类型: 国际会议

会议名称: 2010 IEEE 10th International Conference on Signal Processing(第十届信号处理国际会议 ICSP 2010)

会议地点: 北京

会议语种:英文

页码: 2472-2475

在线出版日期: 2010-08-24（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Text Categorization of Enron Email Corpus Based on Information Bottleneck and Maximal Entropy