Improving Precision of Inter-Document Similarity Measure by Clustering SVD
Text representation, which is a fundamental and necessary step for intelligent text processing, refers to the process of determining index terms for documents and transferring the documents into numeric vectors using index terms. LSI (Latent Semantic Indexing) based on SVD (Singular Value Decomposition)is proposed to overcome the problems of polysemy and homonym in traditional lexical matching. However, it is usually criticized as with low discriminative power for representing documents although it has been validated as with good representative quality. In this paper, clustering SVD, by which SVD is conducted on text clusters not on the whole term-document matrix, is proposed to improve discriminative power of latent semantic indexing based on SVD. The key idea of clustering SVD is to cluster texts in text collection firstly and then SVD is carried out on these text clusters. We conjecture that clustering computation involved in SVD will improve statistical qualities of indexing terms produced by latent semantic indexing. A Chinese corpus and English corpus are used respectively to examine the clustering SVD method. The experiments showed that the proposed method can actually improve precision of inter-document similarity measure in comparison with classic LSI based on SVD. Moreover, more and more significance of its superior performance over LSI based on SVD turns up when less and less preservation rates for matrix approximation are set as required parameters.
tezt representation LSI SVD clustering SVD similarity measure
Wen Zhang Taketoshi Yoshida Xijin Tang
School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Tatsu Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Science
国际会议
广州
英文
61-67
2008-12-11(万方平台首次上网日期,不代表论文的发表时间)