Cross Language Information Retrieval Based On LDA

摘要：

This paper proposed a LDA-based cross-language retrieval model that did not rely on word-by-word translation of query or document. Instead, a parallel corpus was used to estimate a cross-language LDA (Latent Dirichlet Allocation) model. We assumed that a topic variable Z in LDA could generate both an English token and a Chinese token, given that the parallel corpus contained two languages: English and Chinese. Therefore, the LDA model was easy to be extended to multi-language information retrieval as long as a multi-lingual parallel corpus was provided. The proposed LDA-based cross-language retrieval model was compared with three popular retrieval models: LDA-based mono-lingual document model; Mono-lingual TF.IDF retrieval model; Cross-lingual Latent Semantic Indexing retrieval model on CNKI datasets. Experimental results showed that this model was very effective and achieved very good performance.

关键词： LDA topic model cross language information retrieval

作者: Ai Wang YaoDong Li Wei Wang

作者单位: Key Laboratory of Complex System and Intelligence Science,Institute of Automation,Chinese Academy of Sciences

会议类型: 国际会议

会议名称: 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems(2009 IEEE 智能计算与智能系统国际会议)

会议地点: 上海

会议语种:英文

页码: 2300-2305

在线出版日期: 2009-11-20（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Cross Language Information Retrieval Based On LDA