Phrase-based Hierarchical Method for Clustering Search Results
When internet users are facing a great many search results, document clustering techniques are very helpful. Most of these techniques rely on statistical proximity or dependency between single terms of the documents. Since the phrases can typically represent the concepts expressed in text more accurately than single terms, higher clustering accuracy can be achieved using a phrase-based document similarity measure. A phrase-based hierarchical clustering method for clustering search engine results is presented in this paper. This method mainly consists of a phrase-based document similarity measure and an improved hierarchical clustering algorithm. The document similarity measure is motivated by a measure of semantic relatedness, i.e. the Extended Gloss Overlaps Measure. The measure extracts matching phrases using a novel phrases-based document index model, namely the Document Index Graph (DIG). To emphasize the effect of these phrases, it assigns each matching phrase a much greater score than the summation of scores assigned to its constituent terms. Then an improved hierarchical clustering algorithm (IHCA) is proposed to cluster search results. It seeks and merges eligible mutual nearest neighbor pairs at each hierarchy. When the state of mutual nearest neighbor pairs is stable, the intermediate results are clustered sequentially.
Phrase hierarchical clustering document index graph
Yang Ke Han Baoming Li Zujie
School of Traffic and Transportation,Beijing Jiaotong University,Beijing100044 China Warranty Branch,Beijing Public Transport Holdings Ltd.,Beijing 100038 China
国际会议
厦门
英文
1430-1435
2010-05-22(万方平台首次上网日期,不代表论文的发表时间)