会议专题

Web NVrapper Generation Using Tree Alignment and Transfer Learning

This paper studies the web wrapper generation for web pages of forum, blog and news web sites. While more and more web pages are dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment and transfer learning method to generate the wrapper from this kind of web pages. We present a new tree alignment algorithm to find the best matching structure of the input web pages. A kind of linear regression method is employed to get the weight of different tag-matching. Based on the alignment, we merge the trees into one union tree whose nodes record the statistical information gotten from multiple web pages. We use a transfer learning method to find the most likely content block and use the alignment algorithm to detect the repeat patterns on the union tree. After that, we generate a wrapper to extract data from web pages. Experimental results show that the method can achieve high extraction accuracv and has steady performance.

wrapper tree alignment

Yingju Xia Shu Zhang Hao Yu

Fujitsu Research & Development Center Co..LTD.Beijing, China

国际会议

The 2nd International Conference on Software Engineering and Data Mining(IEEE 第二届国际软件工程和数据挖掘学术大会 SEDM 2010)

成都

英文

345-350

2010-06-23(万方平台首次上网日期,不代表论文的发表时间)