Phrase-based Hierarchical Method for Clustering Search Results

摘要：

When internet users are facing a great many search results, document clustering techniques are very helpful. Most of these techniques rely on statistical proximity or dependency between single terms of the documents. Since the phrases can typically represent the concepts expressed in text more accurately than single terms, higher clustering accuracy can be achieved using a phrase-based document similarity measure. A phrase-based hierarchical clustering method for clustering search engine results is presented in this paper. This method mainly consists of a phrase-based document similarity measure and an improved hierarchical clustering algorithm. The document similarity measure is motivated by a measure of semantic relatedness, i.e. the Extended Gloss Overlaps Measure. The measure extracts matching phrases using a novel phrases-based document index model, namely the Document Index Graph (DIG). To emphasize the effect of these phrases, it assigns each matching phrase a much greater score than the summation of scores assigned to its constituent terms. Then an improved hierarchical clustering algorithm (IHCA) is proposed to cluster search results. It seeks and merges eligible mutual nearest neighbor pairs at each hierarchy. When the state of mutual nearest neighbor pairs is stable, the intermediate results are clustered sequentially.

关键词： Phrase hierarchical clustering document index graph

作者: Yang Ke Han Baoming Li Zujie

作者单位: School of Traffic and Transportation,Beijing Jiaotong University,Beijing100044 China Warranty Branch,Beijing Public Transport Holdings Ltd.,Beijing 100038 China

会议类型: 国际会议

会议名称: The Third International Symposium on Test Automation & Instrumentation(第三届国际自动化测试与仪器仪表学术会议 2010 ISTAI)

会议地点: 厦门

会议语种:英文

页码: 1430-1435

在线出版日期: 2010-05-22（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Phrase-based Hierarchical Method for Clustering Search Results