Schedule Web Forum Crawling withA Freshness-First Strategy
Web forums have become an important data resource for research as there is much user generated content (UGC) every day. Thus efficient web forum crawling is a crucial problem. Previous works all focus on crawling all the forum threads with minimal overhead. They treat all threads equallyand adopt a breadthfirst strategy. Some strategies such as PageRank considered the difference in link relations. However, none of themconsider the difference between new threads and the old threads. Thus they are not efficient enough in real-timeapplications. In realtime applications, freshness is a significant factor as users always prefer to fresh results rather than old one&In this paper, we propose a freshness-first strategy for web forum crawling, which aims to fetch the freshercontent prior to less fresh content The freshness-first strategy is based on the characteristicof web forums - usually there are last update times corresponding to the thread URLs.Through detecting the last update times of URLs in board pages, the proposed strategy schedules the crawling order of threads according to their freshness, Lethe last update timcExperiment results demonstrated that the freshness-first strategy definitelyachieved our goal of crawling freshest content first and significantly outperformed other strategiesby 40% in different situations.
component crawling strategy freshness-first web forums
Jingtian Jiang Nenghai Yu
Department of Electronic Engineering and Information Science University of Science and Technology of China Hefei, China
国际会议
哈尔滨
英文
2027-2032
2011-12-24(万方平台首次上网日期,不代表论文的发表时间)