会议专题

Research on Chinese segmentation algorithm based on Hadoop cloud platform

  IKAnalyzer (IK) and ICTCLAS (IC) are very popular Chinese word segmentation algorithms and play an important role in solving text data in a stand-alone environment.In this paper,we compare IK and IC algorithm performance through theory and experiments that reported on experimental work on the mass Chinese text segmentation problem and its optimal solution using the Hadoop cluster,Hadoop Distributed File System (HDFS) for storage and by using parallel processing to process large data sets by using the MapReduce programming framework.The results obtained from various experiments indicate favorable results of above optimized IC and IK algorithms to address mass Chinese text segmentation problems.At the same time,in order to make the large data set after processing is more easily and directly showed,we introduced the Inverted descending order on the segmentation of word frequency in this paper.Through a comparative study of the two kinds of Chinese segmentation algorithm based on Hadoop platform,provides the powerful support for the efficient processing of Chinese mass information.

Chinese word segmentation ICTCLAS IKAnalyzer Inverted descending order HDFS MapReduce Hadoop

Chen Hong

Computer ScienceSchool of Wuhan DonghuUniversity,430212,China

国际会议

2015 Information Technology and Mechatronics Engineering Conference (ITOEC 2015)(2015信息技术与机电一体化国际会议)

重庆

英文

134-138

2015-03-28(万方平台首次上网日期,不代表论文的发表时间)