A Larger Scale Study of Robots.txt
A website can regulate search engine crawler access to its content using the robots exclusion protocol, specified in its robots.txt file. The rules in the protocol enable the site to allow or disallow part or all of its content to certain crawlers, resulting in a favorable or unfavorable bias towards some of them. A 2007 survey on the robots.txt usage of about 7,593 sites found some evidence of such biases, the news of which led to widespread discussions on the web. In this paper, we report on our survey of about 6 million sites. Our survey tries to correct the shortcomings of the previous survey and shows the lack of any significant preferences towards any particular search engine.
Crawler robots exclusion robots.txt searchengine
Santanu Kolay Paolo D’Alberto Ali Dasdan Arnab Bhattacharjee
Yahoo! Inc.Sunnyvale, CA, USA
国际会议
第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)
北京
英文
2008-04-21(万方平台首次上网日期,不代表论文的发表时间)