A Clustering Retrieval System of Chinese information

摘要：

With tremendous and ever-growing amounts of electronic documents from World Wide Web and digital libraries, it becomes more and more difficult to get information that people really want. In order to predigest search process, people use clustering method to browse through search results. However traditional Chinese information clustering techniques are inadequate since they dont generate clusters with highly readable themes. This paper reformats the clustering problem as a salient phrase ranking problem. Given a query and its related ranked list of documents (typically a list of titles and snippets) returned from a certain Web search engine, this method first extracts and ranks salient phrases as candidate cluster theme, based on regression model of SVR (Support Vector Regression) learned from human labeled training data. The documents are assigned to relevant salient phrases to form candidate clusters, and the final clusters are generated by merging these candidate clusters. This paper also searches for a reasonable format to display the final themes of clusters, in order to help users to find the interesting documents easily. Experiment results verified our method feasible and effective.

关键词： Salient phrase document clustering performance of clustering theme

作者: Xin-Guang Sha Yuan-Chao Liu Ming Liu Xiao-Long Wang

作者单位: Intelligent Technology and Natural Language Processing Lab,Harbin Institute of Technology,No.92,West Dazhi Street,NanGang,Harbin,150001,China

会议类型: 国际会议

会议名称: The 2008 IEEE International Conference on Natural Language Processing and Knowledge Engineering(IEEE NLP-KE 2008)(2008IEEE自然语言处理与知识工程国际会议)

会议地点: 北京

会议语种:英文

在线出版日期: 2008-10-19（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Clustering Retrieval System of Chinese information