A Bottom-up Approach of Web Data Extraction based on Entity Recognition and Integration

摘要：

Nowadays, most popular methods for web data extraction (WDE) are top-down ones depending on structure. However, these techniques are not scalable enough when coming to complex pages. Consequently, we put forward a bottom-up approach for WDE based on entity recognition and integration to avoid over dependency to structure of web pages. The approach proposed focuses on primary text sequences labeling first and also gives consideration to repetitive patterns of them as well. We propose a Two-Level extraction model for entity recognition and repetitive pattern extraction algorithm for entity integration. Our approach can effectively reduce the attribute labeling mistakes. Also, we demonstrate our approach by scientifically experimental results. The conclusion is that our approach perform better than the traditional extraction techniques, especially on complex Web pages.

关键词： web data extraction entity recognition entity integration bottom-up

作者: Tong Liu Derong Shen Jing Shan Tiezheng Nie Yue Kou

作者单位: College of Information Science and Engineering Northeastern University Shenyang, China

会议类型: 国际会议

会议名称: 第8届全国web信息系统及应用学术会议

会议地点: 重庆

会议语种:英文

页码: 150-155

在线出版日期: 2011-10-21（万方平台首次上网日期，不代表论文的发表时间）

会议专题

A Bottom-up Approach of Web Data Extraction based on Entity Recognition and Integration