EXPLORING WIKIPEDIA AND QUERY LOGS ABILITY FOR TEXT FEATURE REPRESENTATION

摘要：

The rapid increase of internet technology requires a better management of web page contents.Many text mining researches has been conducted, like text categorization, information retrieval, text clustering.When machine learning methods or statistical models are applied to such a large scale of data, the first step we have to solve is to represent a text document into the way that computers could handle.Traditionally, single words are always employed as features in Vector Space Model, which make up the feature space for all text documents.The single-word based representation is based on the word independence and doesnt consider their relations, which may cause information missing.This paper proposes Wiki-Query segmented features to text classification, in hopes of better using the text information.The experiment results show that a much better F1 value has been achieved than that of classical single-word based text representation.This means that Wikipedia and query segmented feature could better represent a text document.

关键词： Text feature representation Word-Based model Wikipedia (Wiki) Query-Log

作者: BING LI QING-CAI CHEN DANIEL S.YEUNG WING W.Y.NG XIAO-LONG WANG

作者单位: Media and Life Science Computing Laboratory, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China

会议类型: 国际会议

会议名称: 2007 International Conference on Machine Learning and Cybernetics(IEEE第六届机器学习与控制论国际会议)

会议地点: 香港

会议语种:英文

页码: 3343-3348

在线出版日期: 2007-08-19（万方平台首次上网日期，不代表论文的发表时间）

会议专题

EXPLORING WIKIPEDIA AND QUERY LOGS ABILITY FOR TEXT FEATURE REPRESENTATION