会议专题

Can Chinese Web Pages be Classified with English Data Source?

As the World Wide Web in China grows rapidly, mining knowledge in Chinese Web pages becomes more and more important. Mining Web information usually relies on the machine learning techniques which require a large amount of labeled data to train credible models. Although the number of Chinese Web pages increases quite fast, it still lacks Chinese labeled data. However, there are relatively su.cient English labeled Web pages. These labeled data, though in di.erent linguistic representations, share a substantial amount of semantic information with Chinese ones, and can be utilized to help classify Chinese Web pages. In this paper, we propose an information bottleneck based approach to address this cross-language classi.cation problem. Our algorithm.rst translates all the Chinese Web pages to English. Then, all the Web pages, including Chinese and English ones, are encoded through an information bottleneck which can allow only limited information to pass. Therefore, in order to retain as much useful information as possible, the common part between Chinese and English Web pages is inclined to be encoded to the same code (I.e. Class label), which makes the cross-language classi.cation accurate. We evaluated our approach using the Web pages collected from Open Directory Project (ODP). The experimental results show that our method signi.cantly improves several existing supervised and semi-supervised classi.ers.

Cross-Language Classification Information Bottleneck

Xiao Ling Gui-Rong Xue Wenyuan Dai Yun Jiang Qiang Yang Yong Yu

Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China Hong Kong University of Science and Technology, Clearway Bay, Kowloon, Hong Kong

国际会议

第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)

北京

英文

2008-04-21(万方平台首次上网日期,不代表论文的发表时间)