会议专题

Graph-based AJAX Crawl: Mining Data From Rich Internet Applications

AJAX (Asynchronous JavaScript and XML) is becoming more and more popular with the prosperity of web 2.0. However, traditional crawlers fail to retrieve information from AJAX applications because of complex JavaScript operations. Moreover, a single AJAX application with one URL may have different page states, which violates the rule that one URL corresponds to one unique page. The AJAX application can be modeled as a state transition graph and to crawl AJAX is to traverse the graph without prior knowledge of its structure. In this paper, we have distinguished different AJAX events which are not well defined in previous work and proposed a Graphbased AJAX State Traversal (GAST) algorithm to crawl AJAX with minimal edge visits. If topology of the graph is given, this optimization problem turns into a Directed Rural Postman Problem (DRPP) and the optimal lower bound can be obtained. Experimental results show that the proposed algorithm approaches optimum and exhibits better performance than existing work.

AJAX Crawl State Transition Graph State Traversal Directed Rural Postman Problem

Zhaomeng Peng Nengqiang He Chunxiao Jiang Zhihua Li Lei Xu Yipeng Li Yong Ren

Department of Electronic Engineering,Tsinghua University,Beijing, China

国际会议

2012 International Conference on Computer Science and Electronic Engineering(2012 IEEE计算机科学与电子工程国际会议 ICCSEE 2012)

杭州

英文

590-594

2012-03-23(万方平台首次上网日期,不代表论文的发表时间)