Genealogical Trees on the Web: A Search Engine User Perspective

摘要：

This paper presents an extensive study about the evolution of textual content on the Web, which shows how some new pages are created from scratch while others are created using already existing content. We show that a significant fraction of the Web is a byproduct of the latter case. We introduce the concept of Web genealogical tree, in which every page in aWeb snapshot is classified into a component. We study in detail these components, characterizing the copies and identifying the relation between a source of content and a search engine, by comparing page relevance measures, documents returned by real queries performed in the past, and click-through data. We observe that sources of copies are more frequently returned by queries and more clicked than other documents.

关键词： Web text content evolution search engine Web mining

作者: Ricardo Baeza-Yates Alvaro Pereir Nivio Ziviani

作者单位: Yahoo! Research Ocata 1 Barcelona, Spain Federal Univ. Of Minas Gerais Dept. Of Computer Science & Barcelona Media Ocata 1, Barcelona, Spain Federal Univ. of Minas Gerais Dept. of Computer Science Av. Antonio Carlos 6627, ICEx Belo Horizonte

会议类型: 国际会议

会议名称: 第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)

会议地点: 北京

会议语种:英文

在线出版日期: 2008-04-21（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Genealogical Trees on the Web: A Search Engine User Perspective