Web Text Categorization for Large-scale Corpus
Corpus is the set of language materials which are stored in computers and can use computers to search, query and analyze for enterprise decision-makers. Automated text categorization has been extensively studied and various techniques for document categorization. But based on the current scarcity of Chinese corpus, especially in the field of text categorization, the Chinese categorization corpus is especially rare; Besides, most of these experimental prototypes, for the purpose of evaluating different techniques, have been restricted to the heterogeneous, autonomic, dynamic and distributed internet environment This paper proposes and realizes a kind of incremental learning algorithm on large-scale corpus for Chinese text categorization. In this study, an approach based on Support Vector Machines (SVMs) for web text mining of large-scale systems on GBODSS is developed to support enterprise decision making. Experimental results show that our approach has good classification accuracy by incremental learning and it shows speed up of computation time is almost super linear.
grid technology GBODSS large-scale corpus Chinese text categorization
Zhijuan Jia Jianbo Mu
School of Computer Science and Technology Wuhan University of Technology Wuhan, China Institute of Software Science Zhengzhou Normal University ZhengZhou, China
国际会议
太原
英文
188-191
2010-10-22(万方平台首次上网日期,不代表论文的发表时间)