会议专题

MINING COLLECTIVE PAIR DATA FROM THE WEB

Pair data is a kind of data, which consists of two correlative data components.Book title and its author, product name and its price, bilingual translation term and Chinese couplet (a unit of verse consisting of two successive lines) are of this type data.In this paper, based on the observation that pair data tend to co-occur in the same block of the same web page following similar patterns, we propose a new approach to extract the collective pair data.A recursive process is used to extract collective pair data from Web.An automatic algorithm of discovering repeated patterns based on a data structure called PAT tree is proposed to discover all repeated patterns first, then all these repeated patterns are ranked with a ranking SVM to get the trusty pair data extraction patterns.Finally the patterns are transformed with some predefined surface pattern classes and then applied to extract collective pair data.Experimental results demonstrate our model gains higher extraction precision and recall than previous approach.

Pattern discovery Web mining Ranking SVM

CONG FAN LONG JIANG MING ZHOU SHI-LONG WANG

School of Software Engineering, Chongqing University, Chongqing, China, 400044 Microsoft Research Asia, 5F Sigma Center, No.49 Zhichun Road, Haidian, Beijing, China, 100080 College of Mechanical Engineering, Chongqing University, Chongqing, China, 400044

国际会议

2007 International Conference on Machine Learning and Cybernetics(IEEE第六届机器学习与控制论国际会议)

香港

英文

3997-4002

2007-08-19(万方平台首次上网日期,不代表论文的发表时间)