会议专题

Schedule Web Forum Crawling withA Freshness-First Strategy

Web forums have become an important data resource for research as there is much user generated content (UGC) every day. Thus efficient web forum crawling is a crucial problem. Previous works all focus on crawling all the forum threads with minimal overhead. They treat all threads equallyand adopt a breadthfirst strategy. Some strategies such as PageRank considered the difference in link relations. However, none of themconsider the difference between new threads and the old threads. Thus they are not efficient enough in real-timeapplications. In realtime applications, freshness is a significant factor as users always prefer to fresh results rather than old one&In this paper, we propose a freshness-first strategy for web forum crawling, which aims to fetch the freshercontent prior to less fresh content The freshness-first strategy is based on the characteristicof web forums - usually there are last update times corresponding to the thread URLs.Through detecting the last update times of URLs in board pages, the proposed strategy schedules the crawling order of threads according to their freshness, Lethe last update timcExperiment results demonstrated that the freshness-first strategy definitelyachieved our goal of crawling freshest content first and significantly outperformed other strategiesby 40% in different situations.

component crawling strategy freshness-first web forums

Jingtian Jiang Nenghai Yu

Department of Electronic Engineering and Information Science University of Science and Technology of China Hefei, China

国际会议

2011 International Conference on Computer Science and Network Technology(2011计算机科学与网络技术国际会议 ICCSNT 2011)

哈尔滨

英文

2027-2032

2011-12-24(万方平台首次上网日期,不代表论文的发表时间)