会议专题

Scholarly Information Extraction Is Going to Make a Quantum Leap with PubMed Central (PMC)(R) --But Moving from Abstracts to Full Texts Seems Harder than Expected

  With the increasing availability of complete full texts (journal articles), rather than their surrogates (titles, abstracts), as resources for text analytics, entirely new opportunities arise for information extraction and text mining from scholarly publications. Yet, we gathered evidence that a range of problems are encountered for full-text processing when biomedical text analytics simply reuse existing NLP pipelines which were developed on the basis of abstracts (rather than full texts). We conducted experiments with four different relation extraction engines all of which were top performers in previous BioNLP Event Extraction Challenges. We found that abstract-trained engines loose up to 6.6% F-score points when run on full-text data. Hence, the reuse of existing abstract-based NLP software in a full-text scenario is considered harmful because of heavy performance losses. Given the current lack of annotated full-text resources to train on, our study quantifies the price paid for this short cut.

Natural Language Processing Information Storage and Retrieval Information Extraction

Franz Matthies Udo Hahn

Jena University Language & Information Engineering (JULIE) Lab Friedrich-Schiller-Universit(a)t Jena,Jena 07743,Germany

国际会议

第十六届世界医药健康信息学大会((MEDINFO2017)、第二届世界医药健康信息学华语论坛(WCHIS 2017)、第15届全国医药信息学大会(CMIA 2017)

苏州

英文

521-525

2017-08-21(万方平台首次上网日期,不代表论文的发表时间)