General Chinese Webpage Content Extraction Research and Application

摘要：

　　The webpage text extraction is a process of extracting the structured data that meeting the requirements of researches from semi—structured webpages, which is the basis of various network data mining and search application.Based on the general introduction of the webpage information extraction technology, the paper mainly summarizes the versatile existing webpage extraction algorithms, searching and summarizing the advantages, disadvantages of various algorithms and problems badly in need of solutions, Finally adding our own thoughts on this problem, pointing out a direction to the follow-up study.

关键词： webpages text information extraction structure

作者: GUO Dongxu WU Peng

作者单位: School of information Management, Nanjing University of Science and Technology,Nanjing 210094, China

会议类型: 国际会议

会议名称: 第一届信息获取与知识服务国际会议暨第六届搜索行为与用户认知研讨会

会议地点: 武汉

会议语种:英文

页码: 126-131

在线出版日期: 2014-10-10（万方平台首次上网日期，不代表论文的发表时间）

会议专题

General Chinese Webpage Content Extraction Research and Application