Document Clustering and Topic Discovery based on Semantic Similarity in Scientific Literature

摘要：

Unlabeled document collections are becoming increasingly common and mining such databases becomes a major challenge. It is a major issue to retrieve relevant documents from the larger document collection. By clustering the text documents, the documents sharing similar topics are grouped together. Incorporating semantic features will improve the accuracy of document clustering methods. In order to determine at a sight whether the content of a cluster are of user interest or not, topic discovery methods are required to tag each clusters identifying distinct and representative topic of each cluster. Most of the existing topic discovery methods often assign labels to clusters based on the terms that the clustered documents contain. In this paper a modified semantic-based model is proposed where related terms are extracted as concepts for conceptbased document clustering by bisecting k-means algorithm and topic detection method for discovering meaningful labels for the document clusters based on semantic similarity by Testor theory. The proposed method is compared to the Topic Detection by Clustering Keywords method using F-measure and purity as evaluation metrics. Experimental results prove that the proposed semantic-based model outperforms the existing work.

关键词： Document clustering Topic discovery Semantic similarity Concept Testor theory.

作者: J. Jayabharathy S. Kanmani A. Ayeshaa Parveen

作者单位: Department of Computer Science & Engineering department of Information Technology Pondicherry Engineering College Puducherry, India

会议类型: 国际会议

会议名称: 2011 2nd International Conference on Data Storage and Data Engineering(DSDE 2011)(2011年第二届数据存储与数据工程国际会议)

会议地点: 西安

会议语种:英文

页码: 425-429

在线出版日期: 2011-05-13（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Document Clustering and Topic Discovery based on Semantic Similarity in Scientific Literature