Improvising the accuracy in Classification of Spam emails through Outlier Detection and Classification techniques

摘要：

Electronic mail is a common method of exchanging digital messages among people. All who use email, experience the problem of Spam and it becomes essential that an email spam be correctly classified. Data Mining, a powerful new technology with great potential to help companies focus on the most important information in their data warehouses can be utilized to classify Spam. The Spambase dataset obtained from UCI repository is used in this work. Various Classification Algorithms (C4.5, C-RT, ID3, Random tree etc.,) were applied to the dataset in classifying whether an email is Spam or normal. It was identified that the accuracy of the classification algorithms increased after the detection and removal of outliers. Univariate outlier detection with Grubbs test and sigma rule is applied to the dataset for outlier detection. Nearly 117 instances were detected to be Outliers and removed. It is affirmed that some of the Classification algorithms (Multilayer Perceptron, Naives Bayes Cont, PLS-DA, PLS-LDA, Random Tree) provide good results after the removal of Outliers. Random tree classification algorithm gave 99.99％ accuracy and the rules obtained are used to predict the email as spam or normal. The precision of the classifier was verified with a test dataset.

关键词： Data Mining Spam Outliers Classification Algorithms

作者: P.Nancy R.Geetha Ramani Shomona Gracia Jacob

作者单位: Research Scholar, Department of Computer Science and Engineering, Rajalakshmi Engineering College, T Professor & Head, Department of Computer Science and Engineering, Rajalakshmi Engineering College, T

会议类型: 国际会议

会议名称: 2012 International Conference on Future Communication and Computer Technology(2012未来通信与计算机技术国际会议ICFCCT 2012)

会议地点: 哈尔滨

会议语种:英文

页码: 173-179

在线出版日期: 2012-05-19（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Improvising the accuracy in Classification of Spam emails through Outlier Detection and Classification techniques