会议专题

Improving Word Embeddings for Low Frequency Words by Pseudo Contexts

  This paper investigates relations between word semantic den-sity and word frequency.A distributed representations based word av-erage similarity is defined as the measure of word semantic density.We find that the average similarities of low frequency words are always big-ger than that of high frequency words,when the frequency approaches to 400 around,the average similarity tends to stable.The finding keeps cor-rect with changes of the size of training corpus,dimension of distributed representations and number of negative samples in skip-gram model.It also keeps on 17 different languages.Basing on the finding,we propose a pseudo context skip-gram model,which makes use of context words of semantic nearest neighbors of target words.Experiment results show our model achieves significant performance improvements in both word similarity and analogy tasks.

Word Embedding Low Freuqcy Word

Fang Li Xiaojie Wang

School of Computer,Beijing University of Posts and Telecommunications,Beijing,China

国内会议

第十六届全国计算语言学学术会议暨第五届基于自然标注大数据的自然语言处理国际学术研讨会

南京

英文

1-11

2017-10-13(万方平台首次上网日期,不代表论文的发表时间)