Webpage Information Extraction Based on Parsing DOM Tree by Regular Expression

摘要：

　　A kind of webpage extraction technology is introduced to parse information from Document Object Model(DOM)tree in Hyper Text Markup Language(HTML)page by regular expressions.Detailed parsing procedure is proposed and a parsing tool is developed to extract essential data from first page of novels at Qidian.com.The extraction output is briefly analyzed and shows good result.The limitation of the extraction method is also discussed for further development.

作者: CHENYING LI BIN XU RUI GU

作者单位: Management Building 204,Transportation Management College,Dalian Maritime University,Dalian,China Management Building 118,Transportation Management College,Dalian Maritime University,Dalian,China

会议类型: 国际会议

会议名称: 2014 International Conference on Management and Engineering(CME 2014)(2014管理与工程国际会议)

会议地点: 上海

会议语种:英文

页码: 1-6

在线出版日期: 2014-05-24（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Webpage Information Extraction Based on Parsing DOM Tree by Regular Expression