Smoothing LDA Model for Text Categorization

摘要：

Latent Dirichlet Allocation (LDA) is a document level language model.In general,LDA employ the symmetry Dirichlet distribution as prior of the topic-words distributions to implement model smoothing.In this paper,we propose a data-driven smoothing strategy in which probability mass is allocated from smoothing-data to latent variables by the intrinsic inference procedure of LDA.In such a way,the arbitrariness of choosing latent variables priors for the multi-level graphical model is overcome.Following this data-driven strategy,two concrete methods,Laplacian smoothing and Jelinek-Mercer smoothing,are employed to LDA model.Evaluations on different text categorization collections show data-driven smoothing can significantly improve the performance in balanced and unbalanced corpora.

关键词： Text Categorization Latent Dirichlet Allocation Smoothing Graphical Model

作者: Wenbo Li Le Sun Yuanyong Feng Dakun Zhang

作者单位: Institute of Software,Chinese Academy of Sciences,No.4,Zhong Guan Cun South 4th Street,Hai Dian,1001 Institute of Software,Chinese Academy of Sciences,No.4,Zhong Guan Cun South 4th Street,Hai Dian,1001

会议类型: 国际会议

会议名称: 4th Asia Information Retrieval Symposium(AIRS 2008)(第四届亚洲信息检索研讨会)

会议地点: 哈尔滨

会议语种:英文

页码: 83-94

在线出版日期: 2008-01-16（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Smoothing LDA Model for Text Categorization