DESP: An Automatic Data Extractor on Deep Web Pages

摘要：

We present DESP, an automatic data extractor on Deep Web pages for book domain, which can extract data items and label attributes at the same time. The case of DESP is to extract books information such as title, author, price and publisher from result pages returned from bookstore web sites. Although DESP is for a specific domain, the method used by DESP is highly adaptive and can suit other domains. The system consists of two parts, one is Data Record Locater, the Modified Data Locating algorithm used by it overcomes the shortcoming of the MDR algorithm, the other is Attribute Labeler, and the Detect Combine algorithm makes the data item have a more explicit meaning.

关键词： edit distance string similarity algorithm Web

作者: Ji Ma Derong Shen TieZheng Nie

作者单位: Department of Computer Science and Engineering Northeastern University, Shenyang, 110004 P.R.China

会议类型: 国际会议

会议名称: 2010 Seventh Web Information System and Applications Conference(第七届全国web信息系统及其应用学术会议)

会议地点: 呼和浩特

会议语种:英文

页码: 132-136

在线出版日期: 2010-08-20（万方平台首次上网日期，不代表论文的发表时间）

会议专题

DESP: An Automatic Data Extractor on Deep Web Pages