MULTIPLE FEATURES FUSION METHOD FOR IDENTIFYING TEXT TOPIC BOUNDARIES

摘要：

In general, a document should be regarded as form of some coherent units which are called discourse segments. Discovering the segment boundaries is an important task for many natural language processing applications. In this paper, we proposed a new Chinese text topic boundaries identification method based on multiple features fusion. Our approach firstly extracts multiple features of topics shift from text. For each feature, we adopt corresponding F-dotplotting model to respectively calculate the boundary values of neighboring sentences. Subsequently, the useful features among above cues are automatically select and combined to determine topic boundaries automatically by a statistical method based on logistic regression analysis. The experimental result shows that the F-dotplotting method is more effective than common dotplotting method and the multiple features fusion method based on the logistic regression model can effectively improve Chinese text topic segmentation performance.

关键词： Topic boundaries identification multiple features fusion logistic regression model F-dotplotting method

作者: YONG-DONG XU GUANG-RI QUAN YA-DONG WANG ZHI-MING XU

作者单位: School of Computer Science and Technology, Harbin Institute of Technology (Wei Hai), WeiHai 264209, School of Computer Science and Technology, Harbin Institute of Technology (Wei Hai), WeiHai 264209, School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

会议类型: 国际会议

会议名称: 2008 International Conference on Machine Learning and Cybernetics(2008机器学习与控制论国际会议)

会议地点: 昆明

会议语种:英文

页码: 2950-2956

在线出版日期: 2008-07-12（万方平台首次上网日期，不代表论文的发表时间）

会议专题

MULTIPLE FEATURES FUSION METHOD FOR IDENTIFYING TEXT TOPIC BOUNDARIES