CLTS:A New Chinese Long Text Summarization Dataset

摘要：

　　We present CLTS,a Chinese long text summarization dataset,in order to solve the problem that large-scale and high-quality datasets are scarce in automatic summarization,which is a limitation for further research.To the best of our knowledge,it is the first long text summarization dataset in Chinese.Extracted from the Chinese news website ThePaper.cn(https://www.thepaper.cn/),the corpus contains more than 180,000 Chinese long articles and corresponding summaries written by professional editors and authors,which is available online(CLTS dataset is available to download online at https://github.com/lxj5957/CLTS-Dataset).We train and evaluate several existing meth-ods on CLTS to verify the utility and challenges of the dataset,and the results show that the corpus proposed in this paper is useful to set some baselines to contribute to the further research on automatic text summarization.

关键词： Dataset resources Automatic text summarization

作者: Xiaojun Liu Chuang Zhang Xiaojun Chen Yanan Cao Jinpeng Li

作者单位: Institute of Information Engineering,Chinese Academy of Sciences,Beijing,China;School of Cyber Secur Institute of Information Engineering,Chinese Academy of Sciences,Beijing,China

会议类型: 国际会议

会议名称: 9th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC 2020)

会议地点: 郑州

会议语种:英文

页码: 531-542

在线出版日期: 2020-10-14（万方平台首次上网日期，不代表论文的发表时间）

会议专题

CLTS:A New Chinese Long Text Summarization Dataset