在Gensim Word2Vec模型中匹配单词和向量

120

帕特里克：

我已经让gensim Word2Vec实现为我计算了一些单词嵌入。据我所知，一切进展都非常好。现在，我对创建的单词向量进行聚类，以期获得一些语义分组。

下一步，我想看看每个群集中包含的单词（而不是向量）。即，如果我有嵌入矢量[x, y, z]，我想找出该矢量代表哪个实际词。我可以通过调用来获取单词/词汇项目，model.vocab并通过来获得单词向量model.syn0。但是我找不到一个明确匹配的位置。

这比我预期的要复杂，我觉得我可能会错过明显的方法。任何帮助表示赞赏！

问题：

将字词匹配到由...创建的嵌入矢量Word2Vec ()-我该怎么做？

我的方法：

创建模型（下面的代码*）之后，我现在想将分配给每个单词的索引（在build_vocab()阶段中）与输出为的矢量矩阵匹配model.syn0。从而

for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
    print i
    word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
    wordvector= newmod.syn0[i] #get the vector with the corresponding index
    print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True

有没有更好的方法来做到这一点，例如通过将向量输入模型以匹配单词？
这甚至能给我正确的结果吗？

*我创建字向量的代码：

model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
        
model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings

model.save("newmodel")

我发现这个问题很相似，但尚未真正得到解答。

帕特里克：

因此，我找到了一种简单的方法来执行此操作，nmodel模型的名称在哪里。

#zip the two lists containing vectors and words
zipped = zip(nmodel.wv.index2word, nmodel.wv.syn0)

#the resulting list contains `(word, wordvector)` tuples. We can extract the entry for any `word` or `vector` (replace with the word/vector you're looking for) using a list comprehension:
wordresult = [i for i in zipped if i[0] == word]
vecresult = [i for i in zipped if i[1] == vector]

这基于gensim代码。对于gensim的旧版本，您可能需要wv在模型之后删除。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。