python - Gensim Word2vec : Semantic Similarity -


i wanted know difference between gensim word2vec's 2 similarity measures : most_similar() , most_similar_cosmul(). know first 1 works using cosine similarity of word vectors while other 1 uses using multiplicative combination objective proposed omer levy , yoav goldberg. want know how affects results? 1 gives semantic similarity ? etc. eg :

model = word2vec(sentences, size=100, window=5, min_count=5, workers=4) model.most_similar(positive=['woman', 'king'], negative=['man'])                

result : [('queen', 0.50882536), ...]

model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london']) 

result : [(u'iraq', 0.8488819003105164), ...]

from levy , goldberg paper, if trying find analogies (or combining/comparing more 2 word vectors), first method (3cosadd or eq.3 of paper) more susceptible of getting dominated 1 comparison, compared second method (3cosmul or eq.4 of paper).

just semantic similarity between 2 word vectors, method doesn't apply.

example, using google news vectors -

model.similarity('mosul','england') 0.10051745730111421  model.similarity('iraq','england') 0.14772211471143404  model.similarity('mosul','baghdad') 0.83855779792754492  model.similarity('iraq','baghdad') 0.67975755642668911 

now iraq closer england mosul (both being countries), similarity values small ~ 0.1.

on other hand mosul more similar baghdad iraq (geographical/cultural aspects), similarity values of higher order ~ 0.7

now, analogy (england - london + baghdad = x) -

3cosadd being linear sum, allows 1 large similarity term dominate expression. ignores each term reflects different aspect of similarity, , different aspects have different scales.

3cosmul, on other hand - amplifies differences between small quantities , reduces differences between larger ones.

model.most_similar(positive=['baghdad', 'england'], negative=['london']) (u'mosul', 0.5630180835723877) (u'iraq', 0.5184929370880127)  model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london']) (u'mosul', 0.8537653088569641) (u'iraq', 0.8507866263389587) 

Comments