HisTrace: Building a Search Engine of Historical Events
In this paper, we describe an experimental search engine on our Chinese web archive since 2001. The original data set contains nearly 3 billion Chinese web pages crawled from past 5 years. From the collection, 430 million “article-like pages are selected and then partitioned into 68 million sets of similar pages. The titles and publication dates are determined for the pages. An index is built. When searching, the system returns related pages in a chronological order. This way, if a user is interested in news reports or commentaries for certain previously happened event, he/she will be able to find a quite rich set of highly related pages in a convenient way.
Web archive Text mining Replica detection
Lian’en Huang Jonathan J. H. Zhu Xiaoming Li
Institute of Network Computing and Information Systems Peking University Beijing, China P.R. Dept of Media & Communication City University of Hong Kong Kowloon, Hong Kong
国际会议
第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)
北京
英文
2008-04-21(万方平台首次上网日期,不代表论文的发表时间)