Protein-Protein Interaction Extraction from Bio-literature with Compact Features and Data Sampling Strategy

摘要：

A large number of protein-protein interactions (PPIs) have buried in massive biomedical articles published over the years. This leads to the development of automatic PPI extraction methods. However, existing methods based on supervised machine learning still face some challenges: (1) the feature space exploited in these methods is very sparse; and (2) the data used for training are imbalanced with respect to categories to be classified. In this paper, we first construct rich and compact features to alleviate the issue of feature sparseness. With these features, our method outperforms baselines by up to an F-score of 9.58％ on the original AIMed corpus. Furthermore, we propose a data sampling strategy based on undersampling to address the class imbalance problem. In order to re-balance data distribution, samples of the majority class are removed according to the prediction results iteratively. By this means, our method achieves a further 2.49％ improvement in Fscore on the original AIMed corpus.

关键词： component Protein-Protein Interaction Extraction Class Imbalance Feature Sparseness Compact Features Unde- Sampling

作者: Hongtao Zhang Minglie Huang Xiaoyan Zhu

作者单位: State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and TechnologyDepartment of Computer Sci. and Tech, Tsinghua University Beijing 100084, China

会议类型: 国际会议

会议名称: 2011 4th International Conference on Biomedical Engineering and Informatics(第四届生物医学工程与信息学国际会议 BMEI 2011)

会议地点: 上海

会议语种:英文

页码: 1779-1783

在线出版日期: 2011-10-15（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Protein-Protein Interaction Extraction from Bio-literature with Compact Features and Data Sampling Strategy