News Page Discovery Policy for Instant Crawlers

摘要：

Many news pages which are of high freshness requirements are published on the interact every day.They should be downloaded immediately by instant crawlers.Otherwise,they will become outdated soon.In the past,instant crawlers only downloaded pages from a manually generated news website list.Bandwidth is wasted in downloading non-news pages because news websites do not publish news pages exclusively.In this paper,a novel approach is proposed to discover news pages.This approach includes seed selection and news URL prediction based on user behavior analysis.Empirical studies in a user access log for two months show that our approach outperforms the traditional approach in both precision and recall.

关键词： web log user behavior analysis news page discovery

作者: Yong Wang Yiqun Liu Min Zhang Shaoping Ma

作者单位: State Key Lab of Intelligent Tech.& Sys.,Tsinghua University

会议类型: 国际会议

会议名称: 4th Asia Information Retrieval Symposium(AIRS 2008)(第四届亚洲信息检索研讨会)

会议地点: 哈尔滨

会议语种:英文

页码: 520-525

在线出版日期: 2008-01-16（万方平台首次上网日期，不代表论文的发表时间）

会议专题

News Page Discovery Policy for Instant Crawlers