HisTrace: Building a Search Engine of Historical Events

摘要：

In this paper, we describe an experimental search engine on our Chinese web archive since 2001. The original data set contains nearly 3 billion Chinese web pages crawled from past 5 years. From the collection, 430 million “article-like pages are selected and then partitioned into 68 million sets of similar pages. The titles and publication dates are determined for the pages. An index is built. When searching, the system returns related pages in a chronological order. This way, if a user is interested in news reports or commentaries for certain previously happened event, he/she will be able to find a quite rich set of highly related pages in a convenient way.

关键词： Web archive Text mining Replica detection

作者: Lian’en Huang Jonathan J. H. Zhu Xiaoming Li

作者单位: Institute of Network Computing and Information Systems Peking University Beijing, China P.R. Dept of Media & Communication City University of Hong Kong Kowloon, Hong Kong

会议类型: 国际会议

会议名称: 第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)

会议地点: 北京

会议语种:英文

在线出版日期: 2008-04-21（万方平台首次上网日期，不代表论文的发表时间）

会议专题

HisTrace: Building a Search Engine of Historical Events