IRLbot: Scaling to 6 Billion Pages and Beyond

摘要：

This paper shares our experience in designing a web crawler that can download billions of pages using a single-server im-plementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and ˉxed per-host rate-limiting, current crawling algorithms cannot e.ectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and inˉnite loops created by server-side scripts. We o.er a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot run-ning on a single server successfully crawled 6:3 billion valid HTML pages (7:6 billion connection requests) and sustained an average download rate of 319 mb/s (1; 789 pages/s). Un-like our prior experiments with algorithms proposed in re-lated work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes.

关键词： IRLbot large-scale crawling

作者: Hsin-Tsang Lee Derek Leonard Xiaoming Wang Dmitri Loguinov

作者单位: Department of Computer Science, Texas A&M University College Station, TX 77843 USA

会议类型: 国际会议

会议名称: 第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)

会议地点: 北京

会议语种:英文

在线出版日期: 2008-04-21（万方平台首次上网日期，不代表论文的发表时间）

会议专题

IRLbot: Scaling to 6 Billion Pages and Beyond