MINING COLLECTIVE PAIR DATA FROM THE WEB

摘要：

Pair data is a kind of data, which consists of two correlative data components.Book title and its author, product name and its price, bilingual translation term and Chinese couplet (a unit of verse consisting of two successive lines) are of this type data.In this paper, based on the observation that pair data tend to co-occur in the same block of the same web page following similar patterns, we propose a new approach to extract the collective pair data.A recursive process is used to extract collective pair data from Web.An automatic algorithm of discovering repeated patterns based on a data structure called PAT tree is proposed to discover all repeated patterns first, then all these repeated patterns are ranked with a ranking SVM to get the trusty pair data extraction patterns.Finally the patterns are transformed with some predefined surface pattern classes and then applied to extract collective pair data.Experimental results demonstrate our model gains higher extraction precision and recall than previous approach.

关键词： Pattern discovery Web mining Ranking SVM

作者: CONG FAN LONG JIANG MING ZHOU SHI-LONG WANG

作者单位: School of Software Engineering, Chongqing University, Chongqing, China, 400044 Microsoft Research Asia, 5F Sigma Center, No.49 Zhichun Road, Haidian, Beijing, China, 100080 College of Mechanical Engineering, Chongqing University, Chongqing, China, 400044

会议类型: 国际会议

会议名称: 2007 International Conference on Machine Learning and Cybernetics(IEEE第六届机器学习与控制论国际会议)

会议地点: 香港

会议语种:英文

页码: 3997-4002

在线出版日期: 2007-08-19（万方平台首次上网日期，不代表论文的发表时间）

会议专题

MINING COLLECTIVE PAIR DATA FROM THE WEB