会议专题

IRLbot: Scaling to 6 Billion Pages and Beyond

This paper shares our experience in designing a web crawler that can download billions of pages using a single-server im-plementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and ˉxed per-host rate-limiting, current crawling algorithms cannot e.ectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and inˉnite loops created by server-side scripts. We o.er a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot run-ning on a single server successfully crawled 6:3 billion valid HTML pages (7:6 billion connection requests) and sustained an average download rate of 319 mb/s (1; 789 pages/s). Un-like our prior experiments with algorithms proposed in re-lated work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes.

IRLbot large-scale crawling

Hsin-Tsang Lee Derek Leonard Xiaoming Wang Dmitri Loguinov

Department of Computer Science, Texas A&M University College Station, TX 77843 USA

国际会议

第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)

北京

英文

2008-04-21(万方平台首次上网日期,不代表论文的发表时间)