会议专题

Phrase-based Hierarchical Method for Clustering Search Results

When internet users are facing a great many search results, document clustering techniques are very helpful. Most of these techniques rely on statistical proximity or dependency between single terms of the documents. Since the phrases can typically represent the concepts expressed in text more accurately than single terms, higher clustering accuracy can be achieved using a phrase-based document similarity measure. A phrase-based hierarchical clustering method for clustering search engine results is presented in this paper. This method mainly consists of a phrase-based document similarity measure and an improved hierarchical clustering algorithm. The document similarity measure is motivated by a measure of semantic relatedness, i.e. the Extended Gloss Overlaps Measure. The measure extracts matching phrases using a novel phrases-based document index model, namely the Document Index Graph (DIG). To emphasize the effect of these phrases, it assigns each matching phrase a much greater score than the summation of scores assigned to its constituent terms. Then an improved hierarchical clustering algorithm (IHCA) is proposed to cluster search results. It seeks and merges eligible mutual nearest neighbor pairs at each hierarchy. When the state of mutual nearest neighbor pairs is stable, the intermediate results are clustered sequentially.

Phrase hierarchical clustering document index graph

Yang Ke Han Baoming Li Zujie

School of Traffic and Transportation,Beijing Jiaotong University,Beijing100044 China Warranty Branch,Beijing Public Transport Holdings Ltd.,Beijing 100038 China

国际会议

The Third International Symposium on Test Automation & Instrumentation(第三届国际自动化测试与仪器仪表学术会议 2010 ISTAI)

厦门

英文

1430-1435

2010-05-22(万方平台首次上网日期,不代表论文的发表时间)