Webpage Information Extraction Based on Parsing DOM Tree by Regular Expression
A kind of webpage extraction technology is introduced to parse information from Document Object Model(DOM)tree in Hyper Text Markup Language(HTML)page by regular expressions.Detailed parsing procedure is proposed and a parsing tool is developed to extract essential data from first page of novels at Qidian.com.The extraction output is briefly analyzed and shows good result.The limitation of the extraction method is also discussed for further development.
CHENYING LI BIN XU RUI GU
Management Building 204,Transportation Management College,Dalian Maritime University,Dalian,China Management Building 118,Transportation Management College,Dalian Maritime University,Dalian,China
国际会议
2014 International Conference on Management and Engineering(CME 2014)(2014管理与工程国际会议)
上海
英文
1-6
2014-05-24(万方平台首次上网日期,不代表论文的发表时间)