A Scalable Crawler Framework for FLOSS Data

摘要：

　　Free/Libre/Open Source Software (FLOSS) data,such as bug reports,mailing lists and related webpages,contains valuable information for reusing open source software projects.Before conducting further experiment on FLOSS data,researchers often need to download these data into a local storage system.We refer to this pre-process as FLOSS data retrieval,which in many cases can be a challenging task.In this paper,we proposed a crawler framework to ease the process of FLOSS data retrieval.To cope with various types of FLOSS data scattered on the Internet,we designed the framework in a scalable manner where a crawler program can be easily plugged into the system to extend its functionality.Researchers can perform the retrieval process on datasets of various types and sources simply by adding new configurations to the system.We have implemented the framework and provided basic functions via web-based interfaces.We presented the usage of the system by a detailed case study where we retrieved various types of datasets related to Apache Lucene project using our framework.

关键词： FLOSS project data retrieval crawler scalable

作者: Lingxiao Zhang Yanzhen Zou Bing Xie

作者单位: Software Institute,School of Electronics Engineering and Computer Science,Peking University;Key Laboratory of High Confidence Software Technologies,Ministry of Education Beijing 100871,P.R. China

会议类型: 国际会议

会议名称: 第五届亚太网构软件研讨会

会议地点: 长沙

会议语种:英文

页码: 73-79

在线出版日期: 2013-10-23（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Scalable Crawler Framework for FLOSS Data