会议专题

Improving Precision of Inter-Document Similarity Measure by Clustering SVD

Text representation, which is a fundamental and necessary step for intelligent text processing, refers to the process of determining index terms for documents and transferring the documents into numeric vectors using index terms. LSI (Latent Semantic Indexing) based on SVD (Singular Value Decomposition)is proposed to overcome the problems of polysemy and homonym in traditional lexical matching. However, it is usually criticized as with low discriminative power for representing documents although it has been validated as with good representative quality. In this paper, clustering SVD, by which SVD is conducted on text clusters not on the whole term-document matrix, is proposed to improve discriminative power of latent semantic indexing based on SVD. The key idea of clustering SVD is to cluster texts in text collection firstly and then SVD is carried out on these text clusters. We conjecture that clustering computation involved in SVD will improve statistical qualities of indexing terms produced by latent semantic indexing. A Chinese corpus and English corpus are used respectively to examine the clustering SVD method. The experiments showed that the proposed method can actually improve precision of inter-document similarity measure in comparison with classic LSI based on SVD. Moreover, more and more significance of its superior performance over LSI based on SVD turns up when less and less preservation rates for matrix approximation are set as required parameters.

tezt representation LSI SVD clustering SVD similarity measure

Wen Zhang Taketoshi Yoshida Xijin Tang

School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Tatsu Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Science

国际会议

The 9th International Symposium on Knowledge and Systems Sciences,The 4th Asia-Pacific International Conference on Knowledge Management(第九届国际知识与系统科学学术年会暨第四届亚太国际知识管理年会)

广州

英文

61-67

2008-12-11(万方平台首次上网日期,不代表论文的发表时间)