Gensim doesn't allow changing negative sampling distribution parameter #2090

fernandocamargoai · 2018-06-13T20:46:47Z

Description

Like pointed out in the following article, the negative sampling distribution parameter, which is fixed as 0.75 in Gensim, is worth tuning, specially for other applications beyond NLP. So, I'd be very helpful to make it a parameter for the Word2Vec, instead of fixing it.

https://arxiv.org/abs/1804.04212

gojomo · 2018-06-13T22:19:02Z

Thanks for the pointer to that interesting paper! This sounds like an small & low-risk/high-benefit change, so a PR would be welcome.

fernandocamargoai · 2018-06-14T17:46:31Z

Hello, @gojomo. I'm preparing a PR, but I'm having some trouble with some tests falling. The following tests failed:

TestWord2VecModel.testLoadOldModel
TestDoc2VecModel.testLoadOldModel

Both had the same error:

gensim/models/word2vec.py:998: in load
    return load_old_word2vec(*args, **kwargs)
gensim/models/deprecated/word2vec.py:153: in load_old_word2vec
    old_model = Word2Vec.load(*args, **kwargs)
gensim/models/deprecated/word2vec.py:1617: in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
gensim/models/deprecated/old_saveload.py:87: in load
    obj = unpickle(fname)
E               AttributeError: 'module' object has no attribute 'Word2VecKeyedVectors'

I guess it's because of the new attribute I've placed in the Word2Vec. What am I supposed to do to make it compatible with old saved models?

Here is my branch:
https://github.com/fernandocamargoti/gensim/tree/feature/negative_sampling_distribution_parameter

gojomo · 2018-06-14T20:11:30Z

Feel free to create the PR from your branch to gensim/develop even before it's fully ready - that'll make it easier for others to see the changes, and to see the unit-test results in the project's own continuous-integration setup. You can mark it "[WIP]" to be clear it's a Work-In-Progress.

In general, if a class gets a new (necessary) parameter, older saved (python pickled) objects of that type will need to be patched-up, upon load(), to include a workable value (such as the old fixed default). See for example:

https://github.com/RaRe-Technologies/gensim/blob/e50a0574e70b4c8808b16b50f140f21c4bb6352e/gensim/models/base_any2vec.py#L628

The various "if" fixups there either rebuild things that intentionally weren't saved (because they can be fully rebuilt from other saved state), or path-up missing state that newer source requires and older source didn't save. I'm not sure why missing this parameter creates exactly the error you've reported – maybe an earlier error is being silently suppressed – but one of the load() methods like that one is the right place to ensure backward-compatibility for a new parameter.

Without a full review yet, a few other thoughts:

the cumulative-distribution table happens to be an implementation detail, while the exponentiation-adjustment is inherent to the negative-sampling algorithm - so the parameter name needn't refer to the 'cum_table' - I'd suggest it be called ns_exponentinstead
since negative-sampling is also a mode of other sibling models, and the 'negative' parameter controlling it is first defined on the shared superclass BaseWordEmbeddingsModel in base_any2vec.py, that's where this parameter should also be primarily defined/managed

fernandocamargoai · 2018-06-14T20:33:27Z

@gojomo
Thank you for your feedback. Is there a way for me to run just a specific unit test?

gojomo · 2018-06-15T02:17:03Z

There definitely is, I don't have the syntax handy, but IIRC it involves invoking the test_*.py file itself with command-line arguments naming the exact test(s)/test-suites to run.

fernandocamargoai · 2018-06-15T19:36:56Z

@gojomo, I fixed the problems and created the PR.

…tion for `*2vec` models. Fix #2090 (#2093) * Adding ns_exponent parameter to control the negative sampling distribution. * Fixed a code style problem. * Updated the documentation of the ns_exponent parameter.

gojomo added feature Issue described a new feature difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) labels Jun 13, 2018

fernandocamargoai mentioned this issue Jun 15, 2018

Add ns_exponent parameter to control the negative sampling distribution for *2vec models. Fix #2090 #2093

Merged

menshikh-iv closed this as completed in #2093 Jun 22, 2018

gojomo mentioned this issue Oct 20, 2020

Faster evaluation metrics (baked into the library?) #2986

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gensim doesn't allow changing negative sampling distribution parameter #2090

Gensim doesn't allow changing negative sampling distribution parameter #2090

fernandocamargoai commented Jun 13, 2018

gojomo commented Jun 13, 2018

fernandocamargoai commented Jun 14, 2018 •

edited

Loading

gojomo commented Jun 14, 2018

fernandocamargoai commented Jun 14, 2018

gojomo commented Jun 15, 2018

fernandocamargoai commented Jun 15, 2018

Gensim doesn't allow changing negative sampling distribution parameter #2090

Gensim doesn't allow changing negative sampling distribution parameter #2090

Comments

fernandocamargoai commented Jun 13, 2018

Description

gojomo commented Jun 13, 2018

fernandocamargoai commented Jun 14, 2018 • edited Loading

gojomo commented Jun 14, 2018

fernandocamargoai commented Jun 14, 2018

gojomo commented Jun 15, 2018

fernandocamargoai commented Jun 15, 2018

fernandocamargoai commented Jun 14, 2018 •

edited

Loading