Meaningful String Extraction Based on Spectral Clustering for Sensitive Document
Meaningful string extraction plays a very important role in society hotspots discovery and keywords automatic extension of sensitive document filtering.We propose a meaningful string extraction algorithm based on spectral clustering in order to improve the efficiency of the meaningful string extraction,and the method does not depend on POS tagging.The main idea is that we cluster documents based on spectral clustering after segmentation,then we can compute corresponding TF-IDF for each feature term,so extract the first L feature terms with high weights for each document cluster as meaningful strings.The experiments show that the accuracy of our method is 7% higher than k-means method when cope with irregular corpus,because the spectral clustering is robust to data distribution.
meaningful string spectral clustering TF-IDF k-means
Jie Chen Hao Liao Jianlong Tan Jun Li
Beijing University of Posts and Telecommunication,Beijing Institute of Computing Technology Chinese Institute of Computing Technology Chinese Academy of Sciences,Beijing National Engineering Laborator
国际会议
太原
英文
624-628
2011-02-26(万方平台首次上网日期,不代表论文的发表时间)