Smoothing LDA Model for Text Categorization
Latent Dirichlet Allocation (LDA) is a document level language model.In general,LDA employ the symmetry Dirichlet distribution as prior of the topic-words distributions to implement model smoothing.In this paper,we propose a data-driven smoothing strategy in which probability mass is allocated from smoothing-data to latent variables by the intrinsic inference procedure of LDA.In such a way,the arbitrariness of choosing latent variables priors for the multi-level graphical model is overcome.Following this data-driven strategy,two concrete methods,Laplacian smoothing and Jelinek-Mercer smoothing,are employed to LDA model.Evaluations on different text categorization collections show data-driven smoothing can significantly improve the performance in balanced and unbalanced corpora.
Text Categorization Latent Dirichlet Allocation Smoothing Graphical Model
Wenbo Li Le Sun Yuanyong Feng Dakun Zhang
Institute of Software,Chinese Academy of Sciences,No.4,Zhong Guan Cun South 4th Street,Hai Dian,1001 Institute of Software,Chinese Academy of Sciences,No.4,Zhong Guan Cun South 4th Street,Hai Dian,1001
国际会议
4th Asia Information Retrieval Symposium(AIRS 2008)(第四届亚洲信息检索研讨会)
哈尔滨
英文
83-94
2008-01-16(万方平台首次上网日期,不代表论文的发表时间)