General Chinese Webpage Content Extraction Research and Application
The webpage text extraction is a process of extracting the structured data that meeting the requirements of researches from semi—structured webpages, which is the basis of various network data mining and search application.Based on the general introduction of the webpage information extraction technology, the paper mainly summarizes the versatile existing webpage extraction algorithms, searching and summarizing the advantages, disadvantages of various algorithms and problems badly in need of solutions, Finally adding our own thoughts on this problem, pointing out a direction to the follow-up study.
webpages text information extraction structure
GUO Dongxu WU Peng
School of information Management, Nanjing University of Science and Technology,Nanjing 210094, China
国际会议
第一届信息获取与知识服务国际会议暨第六届搜索行为与用户认知研讨会
武汉
英文
126-131
2014-10-10(万方平台首次上网日期,不代表论文的发表时间)