A Bottom-up Approach of Web Data Extraction based on Entity Recognition and Integration
Nowadays, most popular methods for web data extraction (WDE) are top-down ones depending on structure. However, these techniques are not scalable enough when coming to complex pages. Consequently, we put forward a bottom-up approach for WDE based on entity recognition and integration to avoid over dependency to structure of web pages. The approach proposed focuses on primary text sequences labeling first and also gives consideration to repetitive patterns of them as well. We propose a Two-Level extraction model for entity recognition and repetitive pattern extraction algorithm for entity integration. Our approach can effectively reduce the attribute labeling mistakes. Also, we demonstrate our approach by scientifically experimental results. The conclusion is that our approach perform better than the traditional extraction techniques, especially on complex Web pages.
web data extraction entity recognition entity integration bottom-up
Tong Liu Derong Shen Jing Shan Tiezheng Nie Yue Kou
College of Information Science and Engineering Northeastern University Shenyang, China
国际会议
重庆
英文
150-155
2011-10-21(万方平台首次上网日期,不代表论文的发表时间)