RESEARCH ON THEMATIC WORD EXTRACTION BASED ON HIGH QUALITY DATA SOURCES ON THE WEB
The data source selection is one of the most important processes for domain thematic word extraction.Most of the previous work mainly researched on how to the extract keywords from existing corpus with good algorithms.Meanwhile, there is very limited research on how to explore good data sources for text corpus collection.This paper researches on how to use the online web tools to identify high quality data sources.Then, considering the characteristics of subject keywords, we propose an improved TF-IDF weight calculation formula for keywords sorting, and extract the field keywords from the documents by recalculating the weights of candidate words with the improved method.Finally, taking the Chinese herbal medicine field as an example, our result shows that we can have large higher accuracy and higher recall rate at much lower cost with our method given in this paper.
High quality data source identification Subject terms extraction An improved TF-IDF algorithm
DONGHUA PAN JUN SUN
Institute of Systems Engineering,Dalian University of Technology
国际会议
成都
英文
1364-1370
2011-11-25(万方平台首次上网日期,不代表论文的发表时间)