The Keyword Extraction of Chinese Medical Web Page Based on WFTF-IDF Algorithm

摘要：

　　Web page keyword extraction is widely used in web text classification,text clustering,and information retrieval.However,the keyword extraction of the Chinese web page still need be improved and applied,especially in the medical field.This paper proposes an improved TF-IDF algorithm based on WF-TF-IDF to extract keywords from Chinese medical web page.The WF-TF-IDF algorithm considers three factors which are word frequency in the title,description and word distribution of categories in the corpus.We do the datapreprocessing which includes web page denoising,regular expression processing,Chinese word segmentation,synonyms exchanging and stop word filtering.Then we extract keywords based on the result of data-preprocessing.We filter the meaningless words in the extracted keywords according to the part of speech.The experimental results shows that the WFTF-IDF algorithm improves the precision rate and recall rate by about 7%compared to the traditional TF-IDF algorithm.

关键词： keyword extraction TF-IDF Chinese medical web page WF-TF-IDF

作者: Peng Sun LiHua Wang Qianchen Xia

作者单位: Beihang University Software College Beijing,China Beihang University Computer Science Beijing.China

会议类型: 国际会议

会议名称: 第九届网络分布式计算与知识发现国际会议( 2017 International Conference on Cyber-enabled distributed computing and knowledge discovery)

会议地点: 南京

会议语种:英文

页码: 193-198

在线出版日期: 2017-10-12（万方平台首次上网日期，不代表论文的发表时间）

会议专题

The Keyword Extraction of Chinese Medical Web Page Based on WFTF-IDF Algorithm