会议专题

Text Clustering on Authorship Attribution Based on the Features of Punctuations Usage

  This paper proposes a method of extracting writing characteristics of various authors based on their usage of punctuation marks.Comparative analysis has been done between the text clustering effects of the proposed method and character Bigram method using 200 articles of five well-known modern writers.The analysis also covers the performance of Euclidean distance,cosine distance and KLD (Kullback-Leibler) distance used in the text clustering.In conclusion,the analysis results show that:(1) The method proposed in this paper not only features low dimension,but also is superior to Bigram,(2) KLD has obvious advantages compared to Euclidean distance and cosine distance,and F1 value using the Ward hierarchical clustering of KLD distance can reach 96%~99%.

authorship attribution usage of punctuations text clustering distance bigram of characters

Jin Mingzhe Jiang Minghu

Department and Graduate School of Culture and Information Science, Doshisha Univ., 610-321, Japan Lab of Computational Linguistics, School of Humanities and Social Sciences, Tsinghua Univ., Beijing

国际会议

2012 IEEE 11th International Conference on Signal Processing (第11届IEEE信号处理国际会议)

北京

英文

2175-2178

2012-10-21(万方平台首次上网日期,不代表论文的发表时间)