Text Clustering on Authorship Attribution Based on the Features of Punctuations Usage
This paper proposes a method of extracting writing characteristics of various authors based on their usage of punctuation marks.Comparative analysis has been done between the text clustering effects of the proposed method and character Bigram method using 200 articles of five well-known modern writers.The analysis also covers the performance of Euclidean distance,cosine distance and KLD (Kullback-Leibler) distance used in the text clustering.In conclusion,the analysis results show that:(1) The method proposed in this paper not only features low dimension,but also is superior to Bigram,(2) KLD has obvious advantages compared to Euclidean distance and cosine distance,and F1 value using the Ward hierarchical clustering of KLD distance can reach 96%~99%.
authorship attribution usage of punctuations text clustering distance bigram of characters
Jin Mingzhe Jiang Minghu
Department and Graduate School of Culture and Information Science, Doshisha Univ., 610-321, Japan Lab of Computational Linguistics, School of Humanities and Social Sciences, Tsinghua Univ., Beijing
国际会议
2012 IEEE 11th International Conference on Signal Processing (第11届IEEE信号处理国际会议)
北京
英文
2175-2178
2012-10-21(万方平台首次上网日期,不代表论文的发表时间)