Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary #3157

Merged
merged 39 commits into from
Jun 29, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
ab6fb90
Add KeyedVectors.vectors_for_all
Witiko May 25, 2021
98ed69d
Add examples for KeyedVectors.vectors_for_all
Witiko May 25, 2021
be1746b
Support Dictionary in KeyedVectors.vectors_for_all
Witiko May 28, 2021
d81df64
Don't sort keys in KeyedVectors.vectors_for_all, just deduplicate
Witiko May 28, 2021
ef8bea6
Use docstrings in imperative mode (PEP8)
Witiko May 28, 2021
d602018
Guard against KeyError in KeyedVectors.vectors_for_all
Witiko May 28, 2021
13a7ecd
Unit-test dictionary parameter of KeyedVectors.vectors_for_all
Witiko May 28, 2021
6a8c688
Order dictionary by decreasing cfs in KeyedVectors.vectors_for_all
Witiko May 28, 2021
9ebe808
Add allow_inference parameter to KeyedVectors.vectors_for_all
Witiko May 28, 2021
716dc32
Add copy_vecattrs parameter to KeyedVectors.vectors_for_all
Witiko May 28, 2021
77e1889
Move copy_vecattrs tests for KeyedVectors.vectors_for_all
Witiko May 28, 2021
330d5f7
Fix translation of term ids to terms in KeyedVectors.vectors_for_all
Witiko May 28, 2021
8fdda93
Fix a typo in KeyedVectors.vectors_for_all unit test
Witiko May 28, 2021
ba636a2
Do not make assumptions about fake counts in _add_word_to_kv
Witiko May 28, 2021
1a9ea9b
Document that KeyedVectors.vectors_for_all allows arbitrary keys
Witiko May 28, 2021
e5a9a31
Add notes about the behavior of KeyedVectors.vectors_for_all
Witiko May 28, 2021
5eebef0
Properly reference Dictionary in KeyedVectors.vectors_for_all docstring
Witiko May 28, 2021
26baf6d
Make deduplication in KeyedVectors.vectors_for_all a oneliner
Witiko May 31, 2021
98c070e
Remove an unnecessary temporary variable in KeyedVectors.vectors_for_all
Witiko May 31, 2021
8e4d0cf
Make deduplication in KeyedVectors.vectors_for_all a oneliner (cont.)
Witiko May 31, 2021
a4590c1
Add Dictionary.most_common
Witiko May 31, 2021
b14298b
Remove test_vectors_for_all_dictionary unit test
Witiko May 31, 2021
1cf9452
Remove a trailing bracket in an example
Witiko May 31, 2021
9c6f296
Fix unit tests for Dictionary.most_common
Witiko May 31, 2021
e78bfa3
Update an example for SparseTermSimilarityMatrix
Witiko May 31, 2021
32c14c5
Remove Gensim downloader from KeyedVectors.vectors_for_all example
Witiko Jun 22, 2021
9acbcba
Remove include_counts parameter from Dictionary.most_common
Witiko Jun 22, 2021
712ee61
Shorten the KeyedVectors.vectors_for_all example
Witiko Jun 22, 2021
b8625a5
Remove include_counts parameter from Dictionary.most_common (cont.)
Witiko Jun 22, 2021
4aacad2
Use pytest assertion syntax in unit tests
Witiko Jun 22, 2021
a86522c
Remove an unnecessary comment in KeyedVectors.vectors_for_all
Witiko Jun 22, 2021
7ea8337
Remove an unnecessary comment in KeyedVectors.vectors_for_all
Witiko Jun 22, 2021
f08c582
Remove an unnecessary variable in KeyedVectors.vectors_for_all
Witiko Jun 22, 2021
ebc276d
Make the creation of new vocab in KeyedVectors.vectors_for_all explicit
Witiko Jun 22, 2021
3bf7f33
Make AnnoyIndexer use the correct word-vectors in example
Witiko Jun 22, 2021
68b5fc1
Apply suggestions from code review
mpenkov Jun 29, 2021
52e5ee8
Apply suggestions from code review
mpenkov Jun 29, 2021
4dc3756
Update CHANGELOG.md
mpenkov Jun 29, 2021
d319144
Merge branch 'develop' into feature/vectors-for-all
mpenkov Jun 29, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@
import itertools
import warnings
from numbers import Integral
from typing import Iterable
from typing import Iterable, Union

from numpy import (
dot, float32 as REAL, double, array, zeros, vstack,
Expand Down Expand Up @@ -1696,8 +1696,8 @@ def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='ut
msg=f"merged {overlap_count} vectors into {self.vectors.shape} matrix from {fname}",
)

def vectors_for_all(self, keys: Iterable) -> 'KeyedVectors':
"""Produces vectors for all keys in a given iterable.
def vectors_for_all(self, keys: Union[Iterable, Dictionary]) -> 'KeyedVectors':
Witiko marked this conversation as resolved.
Show resolved Hide resolved
"""Produces vectors for all given keys.
Witiko marked this conversation as resolved.
Show resolved Hide resolved

Notes
-----
Expand All @@ -1713,7 +1713,7 @@ def vectors_for_all(self, keys: Iterable) -> 'KeyedVectors':

Parameters
----------
keys : iterable of str
keys : {iterable of str, Dictionary}
The keys that will be vectorized.

Returns
Expand All @@ -1722,7 +1722,10 @@ def vectors_for_all(self, keys: Iterable) -> 'KeyedVectors':
Vectors for all the given keys.

"""
vocabulary = sorted(set(filter(lambda key: key in self, keys)))
if isinstance(keys, Dictionary):
vocabulary = keys.token2id
else:
vocabulary = sorted(set(filter(lambda key: key in self, keys)))
Witiko marked this conversation as resolved.
Show resolved Hide resolved
vocab_size = len(vocabulary)
datatype = self.vectors.dtype
kv = KeyedVectors(self.vector_size, vocab_size, dtype=datatype)
Expand Down
7 changes: 3 additions & 4 deletions gensim/similarities/termsim.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,11 +114,10 @@ class WordEmbeddingSimilarityIndex(TermSimilarityIndex):
>>> from gensim.models.word2vec import LineSentence
>>> from gensim.similarities import WordEmbeddingSimilarityIndex
>>>
>>> corpus = common_texts
>>> model = FastText(corpus, vector_size=20, min_count=1) # train word-vectors on a corpus
>>> model = FastText(common_texts, vector_size=20, min_count=1) # train word-vectors on a corpus
>>> different_corpus = LineSentence(datapath('lee_background.cor'))
>>> dictionary = Dictionary(different_corpus) # construct a vocabulary on a different corpus
>>> word_vectors = model.wv.vectors_for_all(dictionary.token2id) # remove OOV word-vectors and infer new words
>>> word_vectors = model.wv.vectors_for_all(dictionary) # remove OOV word-vectors and infer new words
>>> assert len(dictionary) == len(word_vectors) # all words from our vocabulary received their word-vectors
>>> termsim_index = WordEmbeddingSimilarityIndex(word_vectors)

Expand Down Expand Up @@ -433,7 +432,7 @@ class SparseTermSimilarityMatrix(SaveLoad):
>>> model = Word2Vec(common_texts, vector_size=20, min_count=1) # train word-vectors
>>> annoy = AnnoyIndexer(model, num_trees=2) # use annoy for faster word similarity lookups
Witiko marked this conversation as resolved.
Show resolved Hide resolved
>>> dictionary = Dictionary(common_texts)
>>> word_vectors = model.wv.vectors_for_all(dictionary.token2id)
>>> word_vectors = model.wv.vectors_for_all(dictionary)
>>> termsim_index = WordEmbeddingSimilarityIndex(word_vectors, kwargs={'indexer': annoy})
>>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts]
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary, symmetric=True, dominant=True)
Expand Down