Web Text Categorization for Large-scale Corpus

摘要：

Corpus is the set of language materials which are stored in computers and can use computers to search, query and analyze for enterprise decision-makers. Automated text categorization has been extensively studied and various techniques for document categorization. But based on the current scarcity of Chinese corpus, especially in the field of text categorization, the Chinese categorization corpus is especially rare; Besides, most of these experimental prototypes, for the purpose of evaluating different techniques, have been restricted to the heterogeneous, autonomic, dynamic and distributed internet environment This paper proposes and realizes a kind of incremental learning algorithm on large-scale corpus for Chinese text categorization. In this study, an approach based on Support Vector Machines (SVMs) for web text mining of large-scale systems on GBODSS is developed to support enterprise decision making. Experimental results show that our approach has good classification accuracy by incremental learning and it shows speed up of computation time is almost super linear.

关键词： grid technology GBODSS large-scale corpus Chinese text categorization

作者: Zhijuan Jia Jianbo Mu

作者单位: School of Computer Science and Technology Wuhan University of Technology Wuhan, China Institute of Software Science Zhengzhou Normal University ZhengZhou, China

会议类型: 国际会议

会议名称: The 2010 International Conference on Computer Application and System Modeling(2010计算机应用与系统建模国际会议 ICCASM 2010)

会议地点: 太原

会议语种:英文

页码: 188-191

在线出版日期: 2010-10-22（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Web Text Categorization for Large-scale Corpus