[MRG] Fix similarity bug in NMSLIB indexer + documentation fixes #2899

piskvorky · 2020-07-26T20:49:58Z

Supersedes #2821.

This PR also presented the opportunity to test migrating existing code to post #2698 code. Notes (CC @gojomo):

Changed failing param iter => epochs
Changed failing param size => vector_size
Changed failing len(model.wv.vocab) => len(model.wv) in Annoy.
- @gojomo is this valid even for fastText? What's the current contract on fastText's "size of vectors" – should len() return the total number of admissible keys (=infinity because fastText supports OOV), or count with ngrams, or just count of "full words"…?
Failed with Word2Vec' object has no attribute 'init_sims.
- Fixed by re-introducing Word2Vec.init_sims(). That method still exists in doc2vec, fasttext, keyedvectors etc – not sure why missing from word2vec? @gojomo is this an omission or done on purpose?
Failed with AttributeError: module 'numpy.random' has no attribute 'default_rng'
- Fixed by installing a newer numpy, 1.14.2 => 1.19.1. Our setup.py says >=1.11.3 ought to be enough – CC @mpenkov is this expected? I thought we're testing against the oldest supported version of our dependencies.
- After upgrading numpy, got another error from scipy, No module named 'numpy.testing.decorators'. Fixed by upgrading scipy too, to 1.5.2.

We should cover (at least) those in our migration guide.

The list is likely not complete; it's only what I needed to successfully run our run_annoy.py tutorial & rebuild the sphinx docs.

gojomo · 2020-07-27T00:37:09Z

docs/src/gallery/tutorials/run_annoy.py

    total_time = 0
    for _ in range(queries):
-        rand_vec = model.wv.vectors_norm[np.random.randint(0, len(model.wv.vocab))]
+        rand_vec = model.wv.vectors_norm[np.random.randint(0, len(model.wv))]


Since vectors_norm is no longer a persistent cache of all unit-normed vectors, but rather (for backward-compatibility) re-calculated every time it's requested, this approach will now involve a lot of redundant recalculation of the whole array. Options: (a) grab it once before the loop; (b) adapt to use get_vector(random_index, use_norm=True).

Oh that's sneaky. Isn't it better to fail outright (breaking API), rather than do an expensive operation silently, where existing code expects a cheap dict lookup?

Or, we want to keep backward comp, dynamically norm just one requested vector… I haven't checked the code but why would we recalc the whole array for a single __getitem__?

We're not recalculating for any specific request of a single vector - that does the minimum, and would be a quickie option ((b) above) to improve this code.

But if a caller asks for vectors_norm. it's now calculated & returned (without being cached). Arguably, vectors_norm was barely even part of the public API, as it wasn't guaranteed to even exist if a caller hadn't triggered init_sims() yet. I woudn't expect the problematic pattern here, requesting vectors_norm repeatedly, to be very common, especially when callers who just need single vectors have the get_vector(..., use_norm=True) option.

I suppose we could (a) cache it once requested, to be more like the old behavior, but then we have to add more logic for properly handling other invalidations of this cache elsewhere, and explicitly discarding it at times (like serialization); or (b) eliminate the backward-compatible @getter property, either leaving it blank (to generate cryptic error), or stub an explanatory error message. In that latter case, a new distinctly-name utility method could be added, calculate_unit_normed_vectors(), that avoids creating the impression this is a simple access of an already-existing property.

I'm in favour of removing vectors_norm = hard fail with a clear message (again, noob-friendly = prescription for "what to do next").

-1 on pretending we're a simple dict, when there's a substantial calculation happening behind the scenes.

Interesting I haven't noticed during the #2698 review; I wonder how many other subtle points around deprecations / compatibility I missed. The migration guide + testing will be essential.

Sure, .vectors_norm should be a hard-fail with helpful error message, similar to the one for init_sims(). I can add these for both as part of other touch-up work, such as that in #2892.

Alright. I'll add some tentative fix here (so I don't merge "known to be bad" code), merge this PR, and then let's continue with this in another PR, such as #2892.

@gojomo pushed now. Let me know if OK and I'll merge this PR, to be continued elsewhere.

gensim/models/doc2vec.py

gensim/models/fasttext.py

gensim/models/word2vec.py

setup.py

gojomo · 2020-07-27T01:31:58Z

Changed failing len(model.wv.vocab) => len(model.wv.vocab) in Annoy.

@gojomo is this valid even for fastText? What's the current contract on fastText's "size of vectors" – should len() return the total number of admissible keys (=infinity because fastText supports OOV), or count with ngrams, or just count of "full words"…?

The reported len() of a FastTextKeyedVectors will be that of its known vocabulary, which is inherited from the more generic KeyedVectors, and seems a reasonable value (compared to any alternative number), as it there's a plausible case for it & it will retain the ability to use the FTKV objects as drop-ins where the KV objects will also work.

.vocab should not be used: it's obsolete & the only remaining accessor throws an error: https://github.com/RaRe-Technologies/gensim/blob/344c4abe26f96ef1a7299e82719b82c8a05c67a0/gensim/models/keyedvectors.py#L560-L566

(Simulating backward-compatible behavior would be very hard & inefficient, & after the bewildering mess of different things all called 'vocab' in the prior refactoring, I preferred as little reuse of that general term as possible.)

Failed with Word2Vec' object has no attribute 'init_sims.

Fixed by re-introducing Word2Vec.init_sims(). That method still exists in doc2vec, fasttext, keyedvectors etc – not sure why missing from word2vec? @gojomo is this an omission or done on purpose?

Per my source-line-targeted-comments, I think init_sims() should only remain on KeyedVectors as a (deprecated) hint to those adapting older code - the versions on Word2Vec, Doc2Vec, FastText should be gone, where they persist it's just remaining obsolete-code cleanup to occur.

Failed with AttributeError: module 'numpy.random' has no attribute 'default_rng'

Fixed by installing a newer numpy, 1.14.2 => 1.19.1. Our setup.py says >=1.11.3 ought to be enough – CC @mpenkov is this expected? I thought we're testing against the oldest supported version of our dependencies.

After upgrading numpy, got another error from scipy, No module named 'numpy.testing.decorators'. Fixed by upgrading scipy too, to 1.5.2.

The numpy-performance-regression in the legacy PRNG led me to update code to use their new default_rng API, but I didn't update prerequisites to reflect this (and as you note, testing didn't show any errors). It looks like this default_rng API has been there since numpy 1.17.0 (released a year ago).

I personally feel it's OK (and even beneficial) for fresh gensim releases to require fairly-recent releases of those libraries that are more 'central'/large/many-contributor, like numpy, unless there's a really strong/specific reason to hold back. Lagging projects that may need older numpy/etc should also use older gensim! (And, if there were a strong economic reason some project needs a challenging configuration – such as bleeding-edge gensim but years-old numpy – the cost of creating/maintaining that idiosyncratic-configuration should fall to that project.)

gojomo · 2020-07-27T05:09:35Z

An aside on approximate-nearest-neighbor libraries: there's a recent new option from Google Research – ScaNN – which they show as far outperforming ANNOY & FB Faiss.

piskvorky · 2020-07-27T07:35:37Z

Yeah thanks. I think I created a Gensim ticket for ScanNN when it came out, although I cannot seem to find it now.

piskvorky · 2020-07-27T07:39:52Z

I personally feel it's OK (and even beneficial) for fresh gensim releases to require fairly-recent releases of those libraries

Yes, upgrading numpy is not a problem. I'd expect our tests to fail though. I'm pretty sure we were testing against the oldest lib, explicitly, but with all the recent changes to the testing framework (azure etc) it's possible that fell through the cracks. We should put it back if possible, @mpenkov WDYT?

piskvorky · 2020-07-27T08:02:39Z

One more idea: once we have the migration guide in place, use it on the other tutorials (beyond run_annoy.py), to validate the guide is complete.

The expectation is that all should run smoothly, no errors or deprecation warnings. A stress-test for our migration guide.

The added value is that we refresh the auto-generated notebooks & rst, because I expect many of the outputs & timings there will change with 4.0.0.

piskvorky · 2020-07-28T08:26:56Z

@mpenkov I'll go ahead and merge this once #2899 (comment) and #2899 (comment) are resolved, because I'm worried about getting out of sync again. But please still review when you have a moment.

…ings

piskvorky · 2020-07-30T11:03:52Z

gensim/models/keyedvectors.py

+        )
+
+    def get_normed_vectors(self):
+        # TODO: what's the way for users to get from a matrix index (integer) to the


@gojomo FYI

get_vector(key_or_int, use_norm=True) returns an individual normed vector, by key or raw integer slot. If for some reason a user wants a KeyedVectors that only has the normed vectors, the method unit_normalize_all() (with a suitable warning about the destructiveness of this operation) is available below. (That thus matches the previous init_sims(replace=True) behavior, for those that really need it, but I believe most users of that call were just trying to save memory in a way that's no longer necessary.)

We could add a convenience method for creating a separate new KV instance with unit-normed vectors, but as that goes above-and-beyond prior behavior, & I don't offhand know anyplace that's necessary, I'd rather not bloat this class with more "maybe someone somewhere needs it" functions. (I'd much rather move functions out, first, as per concerns described in #2873.) And if this were a real need, the right way to support it would likely be a generic deep-copy of the KV followed by a unit_normalize_all() on the new KV.

Yes. The idea here was to fuse unit_normalize_all() with get_normed_vectors(): have a single function which returns a new KV instance, normalized.

I think that's cleaner than several in-situ modifiers/getters, more functional less state, but it's true the use-case is quite hypothetical.

get_vector(key_or_int, use_norm=True) returns an individual normed vector, by key or raw integer slot

Hmm, I don't see that in the code. How does the "raw integer slot" option work?

I removed the int option from the docstring, because I don't think it exists.

gojomo · 2020-07-30T18:15:02Z

gensim/models/keyedvectors.py

@@ -1454,7 +1464,7 @@ def save_word2vec_format(self, fname, fvocab=None, binary=False, total_vec=None,
            if not (i == val):
                break
            index_id_count += 1
-        keys_to_write = chain(range(0, index_id_count), store_order_vocab_keys)
+        keys_to_write = itertools.chain(range(0, index_id_count), store_order_vocab_keys)


Elsewhere, I think I've followed a growing convention I've seen elsewhere to consider itertools as a package with a 'well known standard abbreviation' of it, so import itertools as it. I'd prefer it if gensim's project guide specifically recommended this for a small set of packages: np, pd, it, and maybe pkl, as it matches wider practices & simplifies import management. (That's especially true for numpy & itertools, where the prior practice of naming exactly which functions were used meant a jump-around just-typing burden of changing the imports every time some single function was used or removed-from-use.) But whatever project style can be unambiguously declared somewhere authoritative is OK.

No preference from me. This one I changed merely because I saw two side-by-side imports import itertools; from itertools import chain, which looked weird, so I merged them into a single import itertools.

gojomo · 2020-07-30T18:31:16Z

gensim/models/fasttext.py

@@ -1264,14 +1263,14 @@ def save(self, *args, **kwargs):
        kwargs['ignore'] = kwargs.get('ignore', ignore_attrs)
        super(FastTextKeyedVectors, self).save(*args, **kwargs)

-    def get_vector(self, word, use_norm=False):
+    def get_vector(self, word, norm=False):


use_norm was an unfortunate choice of parameter-name which I believe dated back to #1777, & I considered changing it, but didn't want to have to change all uses elsewhere & add another compatibility-burden on users.

But if you're entertaining breaking changes:

(1) not sure norm is a good name either - it makes it sound like you're requesting the vector's (scalar) norm/length;
(2) even supporting this via an optional extra parameter on a dual-use method strikes me as silly/opaque, too. Instead, have __getitem__ be the implementation of the default behavior, add a separate explicitly-named method for the unit-normed vector, like perhaps get_unit_vector(key_or_int), no optional parameters;
(3) more aggressively, maybe let users toggle the model to give either unit-vecs, or raw-vecs, as the default __getitem__ behavior, but have both get_unit_vector() and get_raw_vector() alt options for when a caller needs to specify. Then the desire you've expressed elsewhere to have unit-normed access, via plain dict-bracket-accesses, is available when toggled-on, but the raw data is never destroyed, preventing a class of errors/regrets. I'd put this thought aside as a somewhat complicated "maybe in the future" idea, but if you're OK to break get_vector() signature, & want to prioritize unit-normed access, maybe it's worth making that change now.

TBH I find the idea of producing a new standalone (normed) KV instance cleaner than state toggles.

I'll merge here lest we accrue a monster PR again, let's revisit this in one of your other PRs.

- failing with weird errors on py3.7+3.8, see https://travis-ci.org/github/RaRe-Technologies/gensim/jobs/713448950#L862

piskvorky · 2020-07-30T21:08:56Z

@mpenkov some of the tests unrelated to this PR (I think) started failing: https://travis-ci.org/github/RaRe-Technologies/gensim/jobs/713448950#L862

I disabled them for now and opened a new ticket to look into it: #2908.

Also, how do you check the Circle CI results? Clicking on the "Details" link here brings up a page which asks me to sign in with Github, with read&write access to my repos, which is not happening. How do you view the actual CI results?

mpenkov · 2020-08-07T14:04:15Z

CC @mpenkov is this expected?

Yes.

I thought we're testing against the oldest supported version of our dependencies.

No, that's not the case. When you pip install gensim in a clean environment, pip will pick the latest versions of the packages to install. All the CI builds occur in a clean environment, so in this case the pin was too loose.

mpenkov · 2020-08-07T14:10:21Z

Also, how do you check the Circle CI results? Clicking on the "Details" link here brings up a page which asks me to sign in with Github, with read&write access to my repos, which is not happening. How do you view the actual CI results?

@piskvorky For me, I just click on "Details" and I can see the results for the relevant build.

piskvorky · 2020-08-07T14:18:52Z

Weird. Somebody else will have to inspect the Circle CI results then, I cannot access them.

piskvorky added 2 commits July 26, 2020 20:33

re-applying changes from #2821

b70c826

migrating + regenerating changed docs

a81e547

piskvorky added this to the *2vec aftermath milestone Jul 26, 2020

piskvorky changed the title ~~Fix similarity bug in NMSLIB indexer + documentation fixes~~ [WIP] Fix similarity bug in NMSLIB indexer + documentation fixes Jul 26, 2020

piskvorky requested review from gojomo and mpenkov July 26, 2020 20:50

piskvorky mentioned this pull request Jul 26, 2020

[MRG] Fix similarity bug in NMSLIB indexer #2821

Closed

piskvorky added 3 commits July 26, 2020 22:56

fix forgotten iteritems

78fe1c4

remove extra model.wv

a0e40ca

split overlong doc line

4cf4da0

piskvorky changed the title ~~[WIP] Fix similarity bug in NMSLIB indexer + documentation fixes~~ [MRG] Fix similarity bug in NMSLIB indexer + documentation fixes Jul 26, 2020

gojomo reviewed Jul 27, 2020

View reviewed changes

gensim/models/doc2vec.py Outdated Show resolved Hide resolved

gojomo reviewed Jul 27, 2020

View reviewed changes

gensim/models/fasttext.py Show resolved Hide resolved

gojomo reviewed Jul 27, 2020

View reviewed changes

gensim/models/word2vec.py Show resolved Hide resolved

gojomo reviewed Jul 27, 2020

View reviewed changes

setup.py Show resolved Hide resolved

get rid of six in doc2vec

161ad55

piskvorky force-pushed the pr2821 branch 2 times, most recently from 05282f5 to 0ee9008 Compare July 27, 2020 09:58

increase test timeout for Visdom server

31d2b87

piskvorky force-pushed the pr2821 branch from 0ee9008 to 31d2b87 Compare July 27, 2020 10:01

piskvorky added 2 commits July 30, 2020 12:43

add deprecations for init_sims()

c834e06

remove vectors_norm + add link to migration guide to deprecation warn…

172e37f

…ings

piskvorky commented Jul 30, 2020

View reviewed changes

piskvorky added 2 commits July 30, 2020 13:43

rename vectors_norm everywhere, update tests, regen docs

3919b68

put back no-op property setter of deprecated vectors_norm

d40f685

gojomo reviewed Jul 30, 2020

View reviewed changes

fix typo

872c8ed

piskvorky force-pushed the pr2821 branch from cbc755e to 872c8ed Compare July 30, 2020 19:01

piskvorky added 2 commits July 30, 2020 22:08

fix flake8

4c1b3f7

disable Keras tests

b39eec2

- failing with weird errors on py3.7+3.8, see https://travis-ci.org/github/RaRe-Technologies/gensim/jobs/713448950#L862

piskvorky mentioned this pull request Jul 30, 2020

Re-enable tests for Keras integration #2908

Closed

piskvorky merged commit d5556ea into develop Jul 30, 2020

piskvorky deleted the pr2821 branch July 30, 2020 21:42

This was referenced Aug 2, 2020

Fix doc2vec crash for large sets of doc-vectors #2907

Merged

[MRG] Refresh docs for run_annoy tutorial #2910

Merged

Working from trunk branch 'develop', gensim.__version__ reports as 3.8.1 #2882

Closed

mpenkov mentioned this pull request Aug 7, 2020

Add tests that exercise the version pins in setup.py #2913

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Fix similarity bug in NMSLIB indexer + documentation fixes #2899

[MRG] Fix similarity bug in NMSLIB indexer + documentation fixes #2899

piskvorky commented Jul 26, 2020 •

edited

Loading

gojomo Jul 27, 2020

piskvorky Jul 27, 2020 •

edited

Loading

piskvorky Jul 27, 2020 •

edited

Loading

gojomo Jul 27, 2020

piskvorky Jul 27, 2020 •

edited

Loading

gojomo Jul 29, 2020

piskvorky Jul 30, 2020

piskvorky Jul 30, 2020

gojomo commented Jul 27, 2020 •

edited

Loading

gojomo commented Jul 27, 2020

piskvorky commented Jul 27, 2020 •

edited

Loading

piskvorky commented Jul 27, 2020

piskvorky commented Jul 27, 2020 •

edited

Loading

piskvorky commented Jul 28, 2020 •

edited

Loading

piskvorky Jul 30, 2020

gojomo Jul 30, 2020

piskvorky Jul 30, 2020 •

edited

Loading

piskvorky Jul 30, 2020

piskvorky Jul 30, 2020

gojomo Jul 30, 2020

piskvorky Jul 30, 2020

gojomo Jul 30, 2020

piskvorky Jul 30, 2020 •

edited

Loading

piskvorky commented Jul 30, 2020 •

edited

Loading

mpenkov commented Aug 7, 2020 •

edited

Loading

mpenkov commented Aug 7, 2020

piskvorky commented Aug 7, 2020

[MRG] Fix similarity bug in NMSLIB indexer + documentation fixes #2899

[MRG] Fix similarity bug in NMSLIB indexer + documentation fixes #2899

Conversation

piskvorky commented Jul 26, 2020 • edited Loading

Choose a reason for hiding this comment

piskvorky Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

piskvorky Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gojomo commented Jul 27, 2020 • edited Loading

gojomo commented Jul 27, 2020

piskvorky commented Jul 27, 2020 • edited Loading

piskvorky commented Jul 27, 2020

piskvorky commented Jul 27, 2020 • edited Loading

piskvorky commented Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 30, 2020 • edited Loading

Choose a reason for hiding this comment

piskvorky commented Jul 30, 2020 • edited Loading

mpenkov commented Aug 7, 2020 • edited Loading

mpenkov commented Aug 7, 2020

piskvorky commented Aug 7, 2020

piskvorky commented Jul 26, 2020 •

edited

Loading

piskvorky Jul 27, 2020 •

edited

Loading

piskvorky Jul 27, 2020 •

edited

Loading

piskvorky Jul 27, 2020 •

edited

Loading

gojomo commented Jul 27, 2020 •

edited

Loading

piskvorky commented Jul 27, 2020 •

edited

Loading

piskvorky commented Jul 27, 2020 •

edited

Loading

piskvorky commented Jul 28, 2020 •

edited

Loading

piskvorky Jul 30, 2020 •

edited

Loading

piskvorky Jul 30, 2020 •

edited

Loading

piskvorky commented Jul 30, 2020 •

edited

Loading

mpenkov commented Aug 7, 2020 •

edited

Loading