Improving Word Embeddings for Low Frequency Words by Pseudo Contexts

This paper investigates relations between word semantic den-sity and word frequency.A distributed representations based word av-erage similarity is defined as the measure of word semantic density.We find that the average similarities of low frequency words are always big-ger than that of high frequency words,when the frequency approaches to 400 around,the average similarity tends to stable.The finding keeps cor-rect with changes of the size of training corpus,dimension of distributed representations and number of negative samples in skip-gram model.It also keeps on 17 different languages.Basing on the finding,we propose a pseudo context skip-gram model,which makes use of context words of semantic nearest neighbors of target words.Experiment results show our model achieves significant performance improvements in both word similarity and analogy tasks.
Word Embedding Low Freuqcy Word
Fang Li Xiaojie Wang
School of Computer,Beijing University of Posts and Telecommunications,Beijing,China
国内会议
第十六届全国计算语言学学术会议暨第五届基于自然标注大数据的自然语言处理国际学术研讨会
南京
英文
1-11
2017-10-13(万方平台首次上网日期,不代表论文的发表时间)