-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix similarity bug in NMSLIB indexer + documentation fixes #2899
Conversation
total_time = 0 | ||
for _ in range(queries): | ||
rand_vec = model.wv.vectors_norm[np.random.randint(0, len(model.wv.vocab))] | ||
rand_vec = model.wv.vectors_norm[np.random.randint(0, len(model.wv))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since vectors_norm
is no longer a persistent cache of all unit-normed vectors, but rather (for backward-compatibility) re-calculated every time it's requested, this approach will now involve a lot of redundant recalculation of the whole array. Options: (a) grab it once before the loop; (b) adapt to use get_vector(random_index, use_norm=True)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's sneaky. Isn't it better to fail outright (breaking API), rather than do an expensive operation silently, where existing code expects a cheap dict lookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, we want to keep backward comp, dynamically norm just one requested vector… I haven't checked the code but why would we recalc the whole array for a single __getitem__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not recalculating for any specific request of a single vector - that does the minimum, and would be a quickie option ((b) above) to improve this code.
But if a caller asks for vectors_norm
. it's now calculated & returned (without being cached). Arguably, vectors_norm
was barely even part of the public API, as it wasn't guaranteed to even exist if a caller hadn't triggered init_sims()
yet. I woudn't expect the problematic pattern here, requesting vectors_norm
repeatedly, to be very common, especially when callers who just need single vectors have the get_vector(..., use_norm=True)
option.
I suppose we could (a) cache it once requested, to be more like the old behavior, but then we have to add more logic for properly handling other invalidations of this cache elsewhere, and explicitly discarding it at times (like serialization); or (b) eliminate the backward-compatible @getter
property, either leaving it blank (to generate cryptic error), or stub an explanatory error message. In that latter case, a new distinctly-name utility method could be added, calculate_unit_normed_vectors()
, that avoids creating the impression this is a simple access of an already-existing property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favour of removing vectors_norm
= hard fail with a clear message (again, noob-friendly = prescription for "what to do next").
-1 on pretending we're a simple dict, when there's a substantial calculation happening behind the scenes.
Interesting I haven't noticed during the #2698 review; I wonder how many other subtle points around deprecations / compatibility I missed. The migration guide + testing will be essential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, .vectors_norm
should be a hard-fail with helpful error message, similar to the one for init_sims()
. I can add these for both as part of other touch-up work, such as that in #2892.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. I'll add some tentative fix here (so I don't merge "known to be bad" code), merge this PR, and then let's continue with this in another PR, such as #2892.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo pushed now. Let me know if OK and I'll merge this PR, to be continued elsewhere.
The reported
(Simulating backward-compatible behavior would be very hard & inefficient, & after the bewildering mess of different things all called 'vocab' in the prior refactoring, I preferred as little reuse of that general term as possible.)
Per my source-line-targeted-comments, I think
The numpy-performance-regression in the legacy PRNG led me to update code to use their new I personally feel it's OK (and even beneficial) for fresh gensim releases to require fairly-recent releases of those libraries that are more 'central'/large/many-contributor, like |
An aside on approximate-nearest-neighbor libraries: there's a recent new option from Google Research – ScaNN – which they show as far outperforming ANNOY & FB Faiss. |
Yeah thanks. I think I created a Gensim ticket for ScanNN when it came out, although I cannot seem to find it now. |
Yes, upgrading numpy is not a problem. I'd expect our tests to fail though. I'm pretty sure we were testing against the oldest lib, explicitly, but with all the recent changes to the testing framework (azure etc) it's possible that fell through the cracks. We should put it back if possible, @mpenkov WDYT? |
One more idea: once we have the migration guide in place, use it on the other tutorials (beyond The expectation is that all should run smoothly, no errors or deprecation warnings. A stress-test for our migration guide. The added value is that we refresh the auto-generated notebooks & rst, because I expect many of the outputs & timings there will change with 4.0.0. |
05282f5
to
0ee9008
Compare
@mpenkov I'll go ahead and merge this once #2899 (comment) and #2899 (comment) are resolved, because I'm worried about getting out of sync again. But please still review when you have a moment. |
) | ||
|
||
def get_normed_vectors(self): | ||
# TODO: what's the way for users to get from a matrix index (integer) to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_vector(key_or_int, use_norm=True)
returns an individual normed vector, by key or raw integer slot. If for some reason a user wants a KeyedVectors that only has the normed vectors, the method unit_normalize_all()
(with a suitable warning about the destructiveness of this operation) is available below. (That thus matches the previous init_sims(replace=True)
behavior, for those that really need it, but I believe most users of that call were just trying to save memory in a way that's no longer necessary.)
We could add a convenience method for creating a separate new KV instance with unit-normed vectors, but as that goes above-and-beyond prior behavior, & I don't offhand know anyplace that's necessary, I'd rather not bloat this class with more "maybe someone somewhere needs it" functions. (I'd much rather move functions out, first, as per concerns described in #2873.) And if this were a real need, the right way to support it would likely be a generic deep-copy of the KV followed by a unit_normalize_all()
on the new KV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The idea here was to fuse unit_normalize_all()
with get_normed_vectors()
: have a single function which returns a new KV instance, normalized.
I think that's cleaner than several in-situ modifiers/getters, more functional less state, but it's true the use-case is quite hypothetical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_vector(key_or_int, use_norm=True) returns an individual normed vector, by key or raw integer slot
Hmm, I don't see that in the code. How does the "raw integer slot" option work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the int
option from the docstring, because I don't think it exists.
@@ -1454,7 +1464,7 @@ def save_word2vec_format(self, fname, fvocab=None, binary=False, total_vec=None, | |||
if not (i == val): | |||
break | |||
index_id_count += 1 | |||
keys_to_write = chain(range(0, index_id_count), store_order_vocab_keys) | |||
keys_to_write = itertools.chain(range(0, index_id_count), store_order_vocab_keys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere, I think I've followed a growing convention I've seen elsewhere to consider itertools
as a package with a 'well known standard abbreviation' of it
, so import itertools as it
. I'd prefer it if gensim's project guide specifically recommended this for a small set of packages: np
, pd
, it
, and maybe pkl
, as it matches wider practices & simplifies import management. (That's especially true for numpy & itertools, where the prior practice of naming exactly which functions were used meant a jump-around just-typing burden of changing the imports every time some single function was used or removed-from-use.) But whatever project style can be unambiguously declared somewhere authoritative is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No preference from me. This one I changed merely because I saw two side-by-side imports import itertools; from itertools import chain
, which looked weird, so I merged them into a single import itertools
.
@@ -1264,14 +1263,14 @@ def save(self, *args, **kwargs): | |||
kwargs['ignore'] = kwargs.get('ignore', ignore_attrs) | |||
super(FastTextKeyedVectors, self).save(*args, **kwargs) | |||
|
|||
def get_vector(self, word, use_norm=False): | |||
def get_vector(self, word, norm=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_norm
was an unfortunate choice of parameter-name which I believe dated back to #1777, & I considered changing it, but didn't want to have to change all uses elsewhere & add another compatibility-burden on users.
But if you're entertaining breaking changes:
(1) not sure norm
is a good name either - it makes it sound like you're requesting the vector's (scalar) norm/length;
(2) even supporting this via an optional extra parameter on a dual-use method strikes me as silly/opaque, too. Instead, have __getitem__
be the implementation of the default behavior, add a separate explicitly-named method for the unit-normed vector, like perhaps get_unit_vector(key_or_int)
, no optional parameters;
(3) more aggressively, maybe let users toggle the model to give either unit-vecs, or raw-vecs, as the default __getitem__
behavior, but have both get_unit_vector()
and get_raw_vector()
alt options for when a caller needs to specify. Then the desire you've expressed elsewhere to have unit-normed access, via plain dict-bracket-accesses, is available when toggled-on, but the raw data is never destroyed, preventing a class of errors/regrets. I'd put this thought aside as a somewhat complicated "maybe in the future" idea, but if you're OK to break get_vector()
signature, & want to prioritize unit-normed access, maybe it's worth making that change now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I find the idea of producing a new standalone (normed) KV instance cleaner than state toggles.
I'll merge here lest we accrue a monster PR again, let's revisit this in one of your other PRs.
- failing with weird errors on py3.7+3.8, see https://travis-ci.org/github/RaRe-Technologies/gensim/jobs/713448950#L862
@mpenkov some of the tests unrelated to this PR (I think) started failing: https://travis-ci.org/github/RaRe-Technologies/gensim/jobs/713448950#L862 I disabled them for now and opened a new ticket to look into it: #2908. Also, how do you check the Circle CI results? Clicking on the "Details" link here brings up a page which asks me to sign in with Github, with read&write access to my repos, which is not happening. How do you view the actual CI results? |
Yes.
No, that's not the case. When you |
@piskvorky For me, I just click on "Details" and I can see the results for the relevant build. |
Weird. Somebody else will have to inspect the Circle CI results then, I cannot access them. |
Supersedes #2821.
similarities.nmslib
'smost_similar()
returned correct neighbours but with incorrect similarities.similarities.nmslib
module. Follow up from Prepare gensim 3.8.3 #2820 (comment).similarities.nmslib
to our API documentation – it was missing before.similarities.index
to the more appropriatesimilarities.annoy
.The default argument for metric will be removed in future version of Annoy. Please pass metric='angular' explicitly.
six
in places I touched (re. "removal of python2 constructs").3.8.1
to4.0.0.dev0
; fixes Working from trunk branch 'develop', gensim.__version__ reports as 3.8.1 #2882.This PR also presented the opportunity to test migrating existing code to post #2698 code. Notes (CC @gojomo):
iter
=>epochs
size
=>vector_size
len(model.wv.vocab)
=>len(model.wv)
in Annoy.len()
return the total number of admissible keys (=infinity because fastText supports OOV), or count with ngrams, or just count of "full words"…?Word2Vec' object has no attribute 'init_sims
.Word2Vec.init_sims()
. That method still exists in doc2vec, fasttext, keyedvectors etc – not sure why missing from word2vec? @gojomo is this an omission or done on purpose?AttributeError: module 'numpy.random' has no attribute 'default_rng'
setup.py
says >=1.11.3 ought to be enough – CC @mpenkov is this expected? I thought we're testing against the oldest supported version of our dependencies.No module named 'numpy.testing.decorators'
. Fixed by upgrading scipy too, to1.5.2
.We should cover (at least) those in our migration guide.
The list is likely not complete; it's only what I needed to successfully run our
run_annoy.py
tutorial & rebuild the sphinx docs.