Mining Contents in Web Page Using Cosine Similarity

摘要：

Web pages typically contain a large amount of information that is not part of the main contents of the pages, e.g.; banner ads, navigation bars, copy right and privacy notices, advertisements which are not related to the main content (relevant information). In this paper, an algorithm is proposed that extract the main content from the web documents. The algorithm based on Content Structure Tree (CST). Firstly, the proposed system use HTML Parser to construct DOM (Document Object Model) tree from which construct Content Structure Tree (CST) which can easily separate the main content blocks from the other blocks. The proposed system then introduce cosine similarity measure to evaluate which parts of the CST tree represent the less important and which parts represent the more important of the page. The proposed system can define the ranking of the documents using similarity values and also extracts the top ranked documents as more relevant to the query.

关键词： DOM tree CST tree Cosine Similarity

作者: Swe Swe Nyein

作者单位: University of Computer Studies, Mandalay Mandalay, Myanmar

会议类型: 国际会议

会议名称: 2011 3rd IEEE International Conference on Computer Research and Development(ICCRD 2011)(2011第三届计算机研究与发展国际会议)

会议地点: 上海

会议语种:英文

页码: 472-475

在线出版日期: 2011-03-11（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Mining Contents in Web Page Using Cosine Similarity