Learning Domain Feature from Tezt Corpora

摘要：

For improving performance in automatically electronic documents processing, this paper proposes a concept of domain feature, which is defined as terms that can represent topics of a certain domain. Then it presents a non-lexicon-based approach automatically learning domain feature from text corpora. This approach combines the length first segment algorithm and domain feature possibility(DFP) algorithm. The former segments domain foreground corpora and extracts words and phrases in a satisfying recall rate, while the latter enhances the precision rate of learning by comparing different statistic properties that domain feature shows between foreground and background corpora. Experiments verify that given appropriate foreground and background corpora, this approach significantly improves efficiency in domain feature building and gets better result than manually building does. Algorithms combined in this approach can be widely used in other research domains of knowledge management.

关键词： domain feature length first segment DFP analysis

作者: Juan Yu Yanzhong Dang

作者单位: Institute of Systems Engineering Dalian University of Technology Dalian, P.R.China

会议类型: 国际会议

会议名称: The 4th International Conference on Wireless Communications, Networking and Mobile Computing(第四届IEEE无线通信、网络技术及移动计算国际会议)

会议地点: 大连

会议语种:英文

页码: 1-4

在线出版日期: 2008-10-12（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Learning Domain Feature from Tezt Corpora