Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add new restrict_vocab functionality, most_similar_among #1229

Closed
wants to merge 4 commits into from

Conversation

shubhvachher
Copy link
Contributor

Fix #481 .

I have looked at the suggestions there and implemented this fix in a new function most_similar_among. @gojomo. I have tested the results and they work well.

The topn variable there has a new functionality that became necessary due to the restricted vocabulary.

Also added being able to get vocabulary as a list from a trained word2vec model.

After the first review, will work on making other _among functions.

Copy link
Collaborator

@gojomo gojomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Some code comments attached. most_similar_among() should also get a unit test.

@@ -272,6 +273,15 @@ def word_vec(self, word, use_norm=False):
else:
raise KeyError("word '%s' not in vocabulary" % word)

def get_words_from_vocab(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the property index2word list better than this – it already exists, and matches the order of word-vectors in syn0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes! Will change! I'm still keeping the get_words_from_vocab() because when I started out with gensim, it was confusing for me initially what the w2v.wv.vocab keys and items were for!

    def get_words_from_vocab(self):
        """
        returns the words in the current vocabulary as a list.
        """

        if(not self.index2word):
            raise ValueError("Vocabulary needs to be built before calling this function")
        return self.index2word

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since people might not find index2word, I can see the benefit of an accessor method. However, given that this is now a generic KeyedVectors class, the name shouldn't be 'words'-centric. I'd suggest ordered_keys.

Also, the truthiness-check of self.index2word seems unnecessary to me. Simply returning self.index2word for the caller to draw their own conclusions (from an empty list) should be plenty.

@@ -336,6 +346,95 @@ def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, i
result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
return result[:topn]

def most_similar_among(self, positive=[], negative=[], topn=10, restrict_vocab=None, indexer=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid confusion with the restrict_vocab int parameter, which has a different interpretation, this parameter should have a different name. Maybe among_words?

Copy link
Contributor Author

@shubhvachher shubhvachher Mar 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For lack of imagination : words_list !

if(not suppress_warning):
toWarn = "The following words are not in trained vocabulary : "
toWarn += str(restrict_vocab.difference(vocabulary_words))
warnings.warn(toWarn, UserWarning)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this will log an attention-grabbing WARNING even if all words are present. Preferable to only log when problems occur.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done. thanks for the catch.

raise ValueError("None of the words in restrict_vocab exist in current vocabulary")

restrict_vocab_indices = [self.vocab[word].index for word in words_to_use]
limited = self.syn0norm[restrict_vocab_indices] #syn0norm is an ndarray
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this creates a new array (rather than just a view), which might be a significant memory cost, for large models and when the words-of-interest are a large subset of all words. Maybe this is unavoidable, but might there be any way to do the calculation against the indexes in the original syn0norm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done. Not storing limited anymore...

    words_list_indices = [self.vocab[word].index for word in words_to_use]
            # limited = self.syn0norm[words_list_indices] #syn0norm is an ndarray; storing limited might add a huge memory overhead
        else:
            raise ValueError("words_list must be set/list of words. Please read doc string")

        dists = dot(self.syn0norm[words_list_indices], mean)
        result = []

@@ -1180,9 +1180,16 @@ def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='ut
self.wv.syn0[self.wv.vocab[word].index] = weights
logger.info("merged %d vectors into %s matrix from %s" % (overlap_count, self.wv.syn0.shape, fname))

def get_words_from_vocab(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new KeyedVectors functionality, I don't think we need to add forwarding methods to Word2Vec – those exist to ease the way for older code; any new code can access the KeyedVectors methods directly. (This goes for this method and most_similar_among().)

Copy link
Contributor Author

@shubhvachher shubhvachher Mar 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I was confused at first myself but then I just did it to maintain coding style of word2vec.

Do you think it might be confusing for newcomers if we don't have the forwarding methods? It took me quite a while to figure KeyedVectors as having all after training functionality as word2vec!

Also, most of the word2vec functions don't have docstring and one has to read the docstring of w2v.wv to figure things out. That might throw people off. Would you like me to add docstrings for forwarding functions?

Copy link
Contributor Author

@shubhvachher shubhvachher Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added only this method to class Word2Vec in order to solve the original "people might not find index2word" problem. All other forwarding functions have been removed.

@tmylk
Copy link
Contributor

tmylk commented Mar 22, 2017

Thanks for the feedback! Yes, having more docstrings is very useful. Explaining what functions do and forwarding to KeyedVectors for more

@shubhvachher
Copy link
Contributor Author

shubhvachher commented Mar 22, 2017

Unit testing is going to be comprehensive(CBOW, SG, cosmul?). I'll commit the changes until now. Adding unit tests soon.

@shubhvachher
Copy link
Contributor Author

@tmylk Will do. Also, what do you think about @gojomo's last review suggestion? Should I add forwarding functions to the Word2Vec file for new functions in KeyedVectors or are we going to ask people to use the KeyedVectors class always?

@tmylk
Copy link
Contributor

tmylk commented Mar 22, 2017

Please avoid adding new forwarding functions. KeyedVectors is the way to go.

@shubhvachher
Copy link
Contributor Author

I've only added docstrings in the first commit to fail tests. Have any idea what broke? Travis says :

======================================================================

FAIL: testPipeline (gensim.test.test_sklearn_integration.TestSklearnLDAWrapper)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_sklearn_integration.py", line 108, in testPipeline

    self.assertGreater(score, 0.50)

AssertionError: 0.5 not greater than 0.5

----------------------------------------------------------------------

I haven't changed anything though but I found this within the test_word2vec.py file (not the test that broke) :

if not hasattr(TestWord2VecModel, 'assertLess'):
    # workaround for python 2.6
    def assertLess(self, a, b, msg=None):
        self.assertTrue(a < b, msg="%s is not less than %s" % (a, b))

    setattr(TestWord2VecModel, 'assertLess', assertLess)

Do you think something similar is needed?

@tmylk
Copy link
Contributor

tmylk commented Mar 22, 2017

Thanks for reporting this. Please change the test in testPipeline to be self.assertGreater(score, 0.40). It is a new test.

@shubhvachher
Copy link
Contributor Author

shubhvachher commented Mar 22, 2017

Ok, could you please also check #1233. If proper, I'll add that fix here as well

@shubhvachher
Copy link
Contributor Author

added fix for #1233

@shubhvachher
Copy link
Contributor Author

This is done, when free could you please look this over @gojomo ? Thanks!

@@ -272,6 +273,15 @@ def word_vec(self, word, use_norm=False):
else:
raise KeyError("word '%s' not in vocabulary" % word)

def get_words_from_vocab(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since people might not find index2word, I can see the benefit of an accessor method. However, given that this is now a generic KeyedVectors class, the name shouldn't be 'words'-centric. I'd suggest ordered_keys.

Also, the truthiness-check of self.index2word seems unnecessary to me. Simply returning self.index2word for the caller to draw their own conclusions (from an empty list) should be plenty.

@@ -336,6 +346,106 @@ def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, i
result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
return result[:topn]

def most_similar_among(self, positive=[], negative=[], topn=10, words_list=None, indexer=None,
suppress_warnings=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think suppress_warnings adds extra complication for little benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, calculating difference in sets (which is needed to generate this warning) can be an unnecessary overhead for people using this code in a production environment. I have added a logging.info call to inform users of the same.
Sometimes it is not possible to remove all non-vocabulary words from the words-list, for example, if words-list is generated during runtime. Here using suppress_warnings would be ideal.


if not suppress_warnings:
missing_words = words_list.difference(vocabulary_words)
if(not missing_words): # missing_words is empty
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not usual python if style.

Copy link
Contributor Author

@shubhvachher shubhvachher Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed for every occurrence

else:
toWarn = "The following words are not in trained vocabulary : "
toWarn += str(missing_words)
warnings.warn(toWarn, UserWarning)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems more like something to log than use warnings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Yes, you're right. The module uses logging throughout, unnecessary import and use of warnings was not a good choice.

@@ -355,6 +355,8 @@ class Word2Vec(utils.SaveLoad):
"""
Class for training, using and evaluating neural networks described in https://code.google.com/p/word2vec/

If you're finished training a model (=no more updates, only querying), then switch to the :mod:`gensim.models.KeyedVectors` instance in wv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line-length, here and in many following comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried following pep8, please do check now

@@ -1181,33 +1187,83 @@ def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='ut
logger.info("merged %d vectors into %s matrix from %s" % (overlap_count, self.wv.syn0.shape, fname))

def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, indexer=None):
"""
Please refer to the documentation for `gensim.models.KeyedVectors.most_similar`
In the future please try and use the `gensim.models.KeyedVectors` instance in wv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enough to say "In the future please use..." (no "try and"). If this is a strong recommendation perhaps method should be marked deprecated. Also, does this sort of comment-cleanup belong in PR with new most_similar_among feature?

Copy link
Contributor Author

@shubhvachher shubhvachher Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this and split into a separate PR.

@@ -105,7 +105,7 @@ def testPipeline(self):
text_lda = Pipeline((('features', model,), ('classifier', clf)))
text_lda.fit(corpus, data.target)
score = text_lda.score(corpus, data.target)
self.assertGreater(score, 0.50)
self.assertGreater(score, 0.40)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know @tmylk requested this change but it seems it shoudl go in its own fixup PR by someone who understands that test.

Copy link
Contributor Author

@shubhvachher shubhvachher Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into a separate PR. Thank you though! Taught me a lot about git and got me my first commit :)

@@ -461,6 +462,141 @@ def test_cbow_neg(self):
min_count=5, iter=10, workers=2, sample=0)
self.model_sanity(model)

def test_most_similar_among_CBOW(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because most_similar_among() functionality is merely on a set of vectors, it's not dependent on the mode-of-training – so there's no need for multiple models built different ways. Testing against one vector set, trained if necessary in the most simple/default way, would be sufficient. (In fact, it'd ultimately make even more sense to be inside a set of KeyedVectors tests, rather than Word2Vec tests, but since its sibling methods are tested here it's not necessary to make that big change yet.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, Removed extra tests; using CBOW method to check functionality.

That second part might be important given @tmylk's comment

KeyedVectors is the way to go.

What do you think @tmylk ?

self.assertRaises(ValueError, model.wv.most_similar_among, positive=['graph'])

res_voc = model.wv.index2word[:5] #Gives first 5 words in model vocab
#Testing Warnings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted previously, I think the suppress_warnings toggle adds unnecessary complication. Also, that logging may be more appropriate than warnings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging is used. Suppress warnings becomes useful in case words-list is generated on the go and the user does not want to know about missing words.

model = word2vec.Word2Vec(sentences, size=2, sg=0, min_count=1, hs=1, negative=0)
self.assertRaises(ValueError, model.wv.most_similar_among, positive=['graph'])

res_voc = model.wv.index2word[:5] #Gives first 5 words in model vocab
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usual python style is two-spaces before an end-of-line #-comment, and a space after the # before text. (Applies many other places as well.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed all, that you for pointing it out

@shubhvachher
Copy link
Contributor Author

shubhvachher commented Mar 25, 2017

Thank you for the comprehensive review @gojomo ! I will work on this asap.

Edit: asap being 31st! Sorry, I have mid semesters going on here.

@tmylk
Copy link
Contributor

tmylk commented Apr 3, 2017

@shubhvachher Please continue this pr instead of opening a new one in #1255.

@shubhvachher
Copy link
Contributor Author

shubhvachher commented Apr 6, 2017

Ah, sorry. I am working on develop here; will remember not to do that in the future!

So, changes made : Moved drive by fixes into separate PRs and replied to comments above. rebased uptil now.

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of code duplication between the most_similar and most_similar_among. Suggest adding a new words_list param to the most_similar function.

The difference is just
dists = dot(self.syn0norm[words_list_indices], mean) vs dists = dot(self.syn0norm, mean)

Please raise an exception if both restrict_vocab and words_list are passed.

if topn is False:
pass
else:
if suppress_warnings is False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be an exception, not a warning. Incorrect input can't be surpressed.

logger.info("This warning is expensive to calculate, " \
"especially for large words_list. " \
"If you would rather not remove the missing_words " \
"from words_list please set the " \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better message is "Please intersect with vocabulary words_to_use = vocabulary_words.intersection(words_list) prior to calling the most_similar_among".
Please remove the suppress_warnings flag.


words_list_indices = [self.vocab[word].index for word in words_to_use]
# limited = self.syn0norm[words_list_indices]
# Storing 'limited' might add a huge memory overhead so we avoid doing that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please memory profile this code to provide foundation for this statement.
Please remove commented out code.


"""

if isinstance(words_list, int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a single check and a single raise ValueError.
The most_similar function doesn't take a list of ints so it should not be mentioned here

@tmylk tmylk changed the title Added new restrict_vocab functionality, most_similar_among [WIP] Add new restrict_vocab functionality, most_similar_among Apr 10, 2017
@tmylk
Copy link
Contributor

tmylk commented May 2, 2017

Ping @shubhvachher . It is a useful functionality, would be good to finish this PR

@shubhvachher
Copy link
Contributor Author

Yup, sorry! Finishing up exams here. On this asap.

@menshikh-iv
Copy link
Contributor

Ping @shubhvachher, what status of this PR? Will you finish it soon?

@shubhvachher
Copy link
Contributor Author

Yes @menshikh-iv; my bad. Been travelling and working. I do have access to my laptop though. I'd get on this coming weekend if thats ok!

@menshikh-iv
Copy link
Contributor

Ok @shubhvachher, we will wait.

@menshikh-iv
Copy link
Contributor

ping @shubhvachher

@menshikh-iv
Copy link
Contributor

I close current PR because this looks abandoned.

@shubhvachher
Copy link
Contributor Author

This is surely my fault for lack of time from my summer work and travels; Will open this again as soon as I get the chance and if uncleared. Apologies,
Shubh

@piskvorky
Copy link
Owner

@shubhvachher please do. The (semi-automated) closing of stale tickets is a maintenance thing from our side, not an indication we're not interested in the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants