-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize word2vec.predict_output_word for speed #3153
Vectorize word2vec.predict_output_word for speed #3153
Conversation
…ed a call to sum to numpy.sum to gain performance.
…sibility for the user to input a list of word indices as parameter 'context' instead of a list of words.
Thanks. When you say The |
Here are some results done using my computer, using np.sum str vs int input for context |
Thanks for the details, but can you post the overall numbers? From a larger run, end-to-end. Because isolated micro-benchmarks of "from 0.01314 to 0.00295" can be misleading. Thanks! |
I ran my script on the first 100 samples of |
gensim/models/word2vec.py
Outdated
logger.warning("All input context word indices must be non-negative.") | ||
return None | ||
# take only the ones in the vocabulary | ||
word2_indices = word2_indices[word2_indices < self.wv.vectors.shape[0]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
word2_indices = word2_indices[word2_indices < self.wv.vectors.shape[0]] | |
max_index = self.wv.vectors.shape[0] | |
word2_indices = np.array([index for index in contex_words_list if 0 <= index < max_index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change only aim at making things clearer ? Doesn't it slow the code down with this list operation instead of numpy-only operations ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally a project like Gensim is interested in 1st, correctness; 2nd, clarity; then 3rd, performance - and even then, mainly considering performance with respect to avoiding really-slow/wasteful approaches, or choosing equally-clear simple optimizations, or optimizing heavily-used code-paths that show up as being a problem in real-use-case profiling.
A single list-filter-iteration-via-a-list-comprehension is a very common Python idiom that should more-or-less be considered "costless" except for if-and-when you have special reason to have intense performance concerns about its likely uses. And here, since this method was motivated by the simulation of a word2vec "context window", and such windows are usually pretty small (default window=5
means a 10-word inout, at most), juggling a tiny list is unlikely to be an important bottleneck.
(But again: I suspect suggestions below may make this moot. And it's possible, if use of this method became common in ways we didn't expect – such as using giant input lists, or in a high-volume tight-loop that's key to some tasks' overall performance – it might become a concern. But those problems, if they materialize, might outgrow this method entirely.)
gensim/models/word2vec.py
Outdated
else: | ||
# then, words were passed. Retrieve their indices | ||
word2_indices = [self.wv.get_index(w) for w in context_words_list if w in self.wv] | ||
if not word2_indices: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would it affect the performance if you put the "integer indices" check after the "string indices"?
That is, "if strs, convert to ints => continue always with ints, prune 0 <= index < vectors.shape[0]
, test ints for all-empty, etc". It would unify the two code paths, no need to duplicate code and test twice.
More DRY & readable code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or indeed, like @gojomo says, unify even more and accept mixed strings-and-ints inputs, in a single code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about performance but less duplicates sounds righter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think the most-compact & flexible code would accept a list-of-words-or-word-int-indexes, then convert it to all indexes (by replacing any strings with ints, in a single list pass).
If any bad indexes are passed, or words not in the model, whatever errors those unchecked conditions trigger are probably clear enough: IndexError
or KeyError
, with a stack through the caller code, and this method's code, that uses a bad parameter. So essentially no overhead of parameter-checking.
I'm especially willing to consider this low-overhead approach because I'd categorize this method as 'advanced' and 'experimental' - it's not a typical use/function of Word2Vec
. The method only works for one mode (negative-sampling), and doesn't (without weighting) truly approximate what a training-time negative-sampling prediction would be. The need to sort the activation values of all potential vocabulary words before returning the top-N means it's an inherently fairly-expensive operation. As such, it already has many caveats (that should be reflected in its doc-comment) - and the extra expectation that it should only be called with sensible arguments is no extra burden.
But, also, that KeyedVectors.get_index()
will already take a string-or-int, and return an int index:
So I think in fact choosing to use that .get_index()
method would ensure a single-pass over the word-list would always through a descriptive KeyError
if either the string-word isn't present, or the int-index is out-of-range.
Using But, this much parameter-checking is atypical for Gensim, and lengthens the method quite a bit. My sense is if someone's advanced enough to be passing ints, checking for bounds isn't necessary. (And maybe, negative "from-the-end" indexing might even make sense in some obscure case?) More generally, I tend to think if the downstream errors caused by unchecked bad parameters are reasonably descriptive, it's OK to let them happen. OTOH, if none of the supplied words are known, that may be anomalous enough that raising an exception is more sensible than a logged-warning and I suspect it'd be just as compact/idiomatic/efficient to allow mixed ints & strings. (Use a list comprehension to repalce any string-keys with ints.) ASIDE: If this method were to get a deeper refactoring for broader utility:
|
@gojomo I'm happy if you're OK with mixing ints and strs ! About parameter-checking, I did it thinking, let's not break the code too much. But for my use, all these checks are not necessary. About negative indexing, indeed it felt weird to write, since if there are negative indices, it is probably not by chance. I'm OK with on-purpose negative indexing now you say it. What if a too large int is passed ? There will be an IndexError, do you find it informative enough ? I'm in for sparing time spent in useless checks (even more as gensim is already much optimized), but I'm also for robust code. Raising an exception if no word is known is usually OK for me... However the previous code did use logged-warnings so we should change this part too, right? It has the advantage of not interrupting the execution, and if it only happens on rare cases, it enables to process a whole corpus even if some rare parts behave badly. I hadn't thought of mixing ints and strings... Why not, but is it really useful in any case ? At any rate, if it does not make the code too complex and make it more modular, there's no reason not to do it. The way @piskvorky describes it in a comment above feels natural. On the broader refactoring, I don't get it all, but:
Finally, @gojomo about your screenshot, it seems solved. If I'm guessing right, as it is my first contribution, something had to be approved and it looks like @piskvorky approved it. |
@M-Demay It seems both @piskvorky & me are OK with words being specified by any mix of string-keys and int-indexes – as it simplifies the code quite a bit – so it'd be good to see a PR update taking tht approach. And as noted in my per-line comment, the use of Regarding the broader-refactoring, those aren't things this PR would need to address – just related thoughts I wanted to capture. The The idea of numerically assessing a model by the % of predictions it gets right (or close in a top-N sense) is interesting & potentially more human-interpretable than a model 'loss' number... but note that among many models, the one model that gets the most predictions right may not give the best word-vectors. For example, an overfit model can have a tiny loss (or best-possible prediction-accuracy), but create word-vectors less-useful for any purposes outside that original dataset. That full prediction inherently requires evaluating the model's value for every output-vocabulary word - while the essence of 'negative sampling' is to do a tiny subset of that work that's sufficient to approach the same results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally had a look at this. The changes make sense to me. @M-Demay please respond to the individual suggestions from @piskvorky, push your changes and then we can merge this.
…ng to trying to make it more compact and versatile.
97e687e
to
84258b4
Compare
I tried to take your comments into account (I hope I got them all right):
Remark: first I committed the comment changes from @piskvorky but then I realised it was useless since I had refactored the code, hence the Feel free to request changes on the format of my comments, or on the code itself again. |
* Retained the suggested `sum`->`np.sum` replacement, which has been tested to yield significant runtime gains. * Dropped unnecessary type/value checks that are already run when calling the `KeyedVectors.__isin__` dunder method. * Corrected the docstring to accurately document the supported inputs (which were already compatible prior to the PR this commit is a part of).
Current minimalist optimization (just |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add some tests for the new functionality? We need to see that specifying words via ints (and ints mixed with strings) gives predictable results.
Thank you!
If you look at the current diff, there's no new functionality, only a 1-liner optimization to old functionality that leaves all existing tests working. The only reference to differed functionality is that the comment has been changed to more accurately reflect what seems-like-it-should work. I don't think that straightforward optimization should be held up awaiting new tests for latent functionality that wasn't added in this PR. if making the extra claim of mixed str-and-int working in the comment, without a test to cover it, is a concern, it'd make more sense to roll back that comment claim. That mixed str-and-int might work will then just be a bonus if someone tries it or reads the code. |
Yes, that would be also be fine. |
I think this PR turned out to have the highest comments-to-lines-changed ratio in Gensim history :) Thanks for your diligent work @M-Demay ! |
Well thank you @piskvorky, I did not think I would be so famous so fast. |
Finally merged this. Thank you @M-Demay for your effort and your patience! |
Motivation: in a project of mine I was doing repeated calls to
gensim.models.word2vec.Word2vec.predict_output_word
and it was taking very long (really too long). I dug into the source code and found out that there was a call to the built-in functionsum
. I replaced it with a call tonp.sum
.As I was doing many calls to
gensim.models.word2vec.Word2vec.predict_output_word
, I thought I was callinggensim.models.word2vec.wv.get_index
redundantly. Consequently I wanted to add the possibility for the user (me in the first place) to callgensim.models.word2vec.Word2vec.predict_output_word
with parameter context as a list of word indices, not words. So I implemented this as well. In the process I added several sanity checks for the sake of code robustness and made the API compatible with the existing behaviour (for example usinglogger.warning
). I also updated the doc of the method to explicit the new behaviour.Note that line 1842 of
gensim/models/word2vec.py
(version I am pushing) I removed the conditionif not word2_indices
, because it was already tested before.In the end I gained a factor about 4 in percall time spent in
gensim.models.word2vec.Word2vec.predict_output_word
by replacingsum
bynp.sum
in an experiment of mine. With the second change (context as a list of indices, not words) I gained 33% percall execution time, for an overall time gain very appreciable. I didn't investigate further this time gain, but these experiments were made on a single run with several thousand calls togensim.models.word2vec.Word2vec.predict_output_word
.Remark: I did
$ tox -e py36-linux
and the last 4 tests ingensim/test/test_api.py
failed. However, note that they also failed before I did my changes. They seem to be network-related (I am working behind a proxy which daily causes me trouble) and I do not think my changes affected them. No other test failed.Suggestion: unit tests might be added to check / improve code robustness in the case when the user gives context as a list of int. I did not take time to do it so far.
Could you please notify me if and when these changes are released in official versions (especially with pip) ? Thank you for reviewing !