TOPIC DISTILLATION AND CLUSTERING ALGORITHM BASED ON THE TOPOLOGY OF PAGES-KEYWORDS
Hits algorithm has gotten great success and been applied in the analysis of web linking. Hits algorithm is used to search the authority pages and the hub pages from the results of the search engine, and it can also be used to search the web communities. But Hits algorithm is based on the hyperlinks of the pages, it is easy to bring the problem of topic excursion. Hits algorithm requires a number of pages as the basic-set for calculating and can not be used in plain texts. This paper introduces a new algorithm: PK-TDC which makes use of the iterative idea of Hits. PK-TDC searches the authority pages and keywords on the topology of pages-keywords, and clusters the pages by their including keywords. The experiment shows PK-TDC algorithm significantly performs in extracting the subjects and clustering not only in the pages with hyperlinks but also in the plain texts.
Hits topic extracting community search topic clustering
JIAN-SHUANG DENG QI-LUN ZHENG HONG PENG
Department of Computing Science, The South China University of Technology, Guangzhou 510641, China
国际会议
2006 International Conference on Machine Learning and Cybernetics(IEEE第五届机器学习与控制论坛)
大连
英文
1581-1586
2006-08-13(万方平台首次上网日期,不代表论文的发表时间)