Recrawl Scheduling Based on Information Longevity

摘要：

It is crucial for a web crawler to distinguish between ephemera and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, con-tent that persists across multiple page updates (e.g., recent blog postings) may be worth acquiring, because it matches the pages true content for a sustained period of time. In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches.

作者: Christopher Olston Sandeep Pandey

作者单位: Yahoo! Research Santa Clara, California Carnegie Mellon University Pittsburgh, Pennsylvania

会议类型: 国际会议

会议名称: 第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)

会议地点: 北京

会议语种:英文

在线出版日期: 2008-04-21（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Recrawl Scheduling Based on Information Longevity