会议专题

A Bottom-up Approach of Web Data Extraction based on Entity Recognition and Integration

Nowadays, most popular methods for web data extraction (WDE) are top-down ones depending on structure. However, these techniques are not scalable enough when coming to complex pages. Consequently, we put forward a bottom-up approach for WDE based on entity recognition and integration to avoid over dependency to structure of web pages. The approach proposed focuses on primary text sequences labeling first and also gives consideration to repetitive patterns of them as well. We propose a Two-Level extraction model for entity recognition and repetitive pattern extraction algorithm for entity integration. Our approach can effectively reduce the attribute labeling mistakes. Also, we demonstrate our approach by scientifically experimental results. The conclusion is that our approach perform better than the traditional extraction techniques, especially on complex Web pages.

web data extraction entity recognition entity integration bottom-up

Tong Liu Derong Shen Jing Shan Tiezheng Nie Yue Kou

College of Information Science and Engineering Northeastern University Shenyang, China

国际会议

第8届全国web信息系统及应用学术会议

重庆

英文

150-155

2011-10-21(万方平台首次上网日期,不代表论文的发表时间)