DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS An Unsupervised Feature Extraction for Document ClusteringA COMPREHENSIVE STUDY OF THE EFFECT OF CLASS IMBALANCE ON THE PERFORMANCE OF CLASSIFIERS

摘要：

This paper provides experimental results showing how we can use maximal substrings as elementary features in document clustering. We extract maximal substrings, i.e., the substrings each giving a smaller number of occurrences even after adding only one character at its head or tail, from the given document set and represent each document as a bag of maximal substrings after reducing the variety of maximal substrings by a simple frequency-based selection. This extraction can be done in an unsupervised manner. Our experiment aims to compare bag of maximal substrings representation with bag of words representation in document clustering. For clustering documents, we utilize Dirichlet compound multinomials, a Bayesian version of multinomial mixtures, and measure the results by F-score. Our experiment showed that maximal substrings were as effective as words extracted by a dictionary-based morphological analysis for Korean documents. For Chinese documents, maximal substrings were not so effective as words extracted by a supervised segmentation based on conditional random fields. However, one fourth of the clustering results given by bag of maximal substrings representation achieved F-scores better than the mean F-score given by bag of words representation. It can be said that the use of maximal substrings achieved an acceptable performance in document clustering.

关键词： Maximal substrings Document clustering Suffix array Bayesian modeling

作者: Tomonari Masada Yuichiro Shibata Kiyoshi Oguri

作者单位: Graduate School of Engineering, Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki, 8528521, Japan

会议类型: 国际会议

会议名称: 13th International Conference on Enterprise Information System(第13届企业信息系统国际会议 ICEIS 2011)

会议地点: 北京

会议语种:英文

页码: 1907-1913

在线出版日期: 2011-06-08（万方平台首次上网日期，不代表论文的发表时间）

会议专题

DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS An Unsupervised Feature Extraction for Document ClusteringA COMPREHENSIVE STUDY OF THE EFFECT OF CLASS IMBALANCE ON THE PERFORMANCE OF CLASSIFIERS