Automatic Extraction and Filtration of Multiword Units1

摘要：

we use five statistical models including Dice coefficient (Dice), Φ 2 coefficient (Φ2), log likelihood ratio (LLR), symmetrical conditional probability (SCP), and normalized expectation(NE) to extract multiword unit candidates from patent corpus. We compare the results from five models and find the number of multiword unit candidates using NE is the most and the precision of Dice is the maximal, but the number of multiword unit candidates using Dice is the least and the precision of SCP is the minimum. Next the multiword unit candidates are filtrated using these filtration strategies including stop words, the threshold, higher frequency, first stop words, last stop words, and context entropy. After filtration, the number of multiword units using NE is the most and the precision of Dice is the maximal, but the number of multiword units using Dice is the least and the precision of SCP is the minimum. Each filtration strategy all help to identify the wrong or unreasonable multiword units and improve the precision of multiword units.

关键词： multiword unit Dice Φ2 SCP NE LLR extract filtrate

作者: Ying Liu Zheng Tie

作者单位: Department of Chinese Language and Literature, Tsinghua University Beijing, China, 100084

会议类型: 国际会议

会议名称: 2011 Eighth International Conference on Fuzzy System and Knowledge Discovery(第八届模糊系统与知识发现国际会议 FSKD 2011)

会议地点: 上海

会议语种:英文

页码: 2651-2655

在线出版日期: 2011-07-26（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Automatic Extraction and Filtration of Multiword Units1