会议专题

Smoothing LDA Model for Text Categorization

Latent Dirichlet Allocation (LDA) is a document level language model.In general,LDA employ the symmetry Dirichlet distribution as prior of the topic-words distributions to implement model smoothing.In this paper,we propose a data-driven smoothing strategy in which probability mass is allocated from smoothing-data to latent variables by the intrinsic inference procedure of LDA.In such a way,the arbitrariness of choosing latent variables priors for the multi-level graphical model is overcome.Following this data-driven strategy,two concrete methods,Laplacian smoothing and Jelinek-Mercer smoothing,are employed to LDA model.Evaluations on different text categorization collections show data-driven smoothing can significantly improve the performance in balanced and unbalanced corpora.

Text Categorization Latent Dirichlet Allocation Smoothing Graphical Model

Wenbo Li Le Sun Yuanyong Feng Dakun Zhang

Institute of Software,Chinese Academy of Sciences,No.4,Zhong Guan Cun South 4th Street,Hai Dian,1001 Institute of Software,Chinese Academy of Sciences,No.4,Zhong Guan Cun South 4th Street,Hai Dian,1001

国际会议

4th Asia Information Retrieval Symposium(AIRS 2008)(第四届亚洲信息检索研讨会)

哈尔滨

英文

83-94

2008-01-16(万方平台首次上网日期,不代表论文的发表时间)