Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gensim doesn't allow changing negative sampling distribution parameter #2090

Closed
fernandocamargoai opened this issue Jun 13, 2018 · 6 comments · Fixed by #2093
Closed

Gensim doesn't allow changing negative sampling distribution parameter #2090

fernandocamargoai opened this issue Jun 13, 2018 · 6 comments · Fixed by #2093
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple)

Comments

@fernandocamargoai
Copy link
Contributor

Description

Like pointed out in the following article, the negative sampling distribution parameter, which is fixed as 0.75 in Gensim, is worth tuning, specially for other applications beyond NLP. So, I'd be very helpful to make it a parameter for the Word2Vec, instead of fixing it.

https://arxiv.org/abs/1804.04212

@gojomo
Copy link
Collaborator

gojomo commented Jun 13, 2018

Thanks for the pointer to that interesting paper! This sounds like an small & low-risk/high-benefit change, so a PR would be welcome.

@gojomo gojomo added feature Issue described a new feature difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) labels Jun 13, 2018
@fernandocamargoai
Copy link
Contributor Author

fernandocamargoai commented Jun 14, 2018

Hello, @gojomo. I'm preparing a PR, but I'm having some trouble with some tests falling. The following tests failed:

  • TestWord2VecModel.testLoadOldModel
  • TestDoc2VecModel.testLoadOldModel

Both had the same error:

gensim/models/word2vec.py:998: in load
    return load_old_word2vec(*args, **kwargs)
gensim/models/deprecated/word2vec.py:153: in load_old_word2vec
    old_model = Word2Vec.load(*args, **kwargs)
gensim/models/deprecated/word2vec.py:1617: in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
gensim/models/deprecated/old_saveload.py:87: in load
    obj = unpickle(fname)
E               AttributeError: 'module' object has no attribute 'Word2VecKeyedVectors'

I guess it's because of the new attribute I've placed in the Word2Vec. What am I supposed to do to make it compatible with old saved models?

Here is my branch:
https://github.com/fernandocamargoti/gensim/tree/feature/negative_sampling_distribution_parameter

@gojomo
Copy link
Collaborator

gojomo commented Jun 14, 2018

Feel free to create the PR from your branch to gensim/develop even before it's fully ready - that'll make it easier for others to see the changes, and to see the unit-test results in the project's own continuous-integration setup. You can mark it "[WIP]" to be clear it's a Work-In-Progress.

In general, if a class gets a new (necessary) parameter, older saved (python pickled) objects of that type will need to be patched-up, upon load(), to include a workable value (such as the old fixed default). See for example:

https://github.com/RaRe-Technologies/gensim/blob/e50a0574e70b4c8808b16b50f140f21c4bb6352e/gensim/models/base_any2vec.py#L628

The various "if" fixups there either rebuild things that intentionally weren't saved (because they can be fully rebuilt from other saved state), or path-up missing state that newer source requires and older source didn't save. I'm not sure why missing this parameter creates exactly the error you've reported – maybe an earlier error is being silently suppressed – but one of the load() methods like that one is the right place to ensure backward-compatibility for a new parameter.

Without a full review yet, a few other thoughts:

  • the cumulative-distribution table happens to be an implementation detail, while the exponentiation-adjustment is inherent to the negative-sampling algorithm - so the parameter name needn't refer to the 'cum_table' - I'd suggest it be called ns_exponentinstead

  • since negative-sampling is also a mode of other sibling models, and the 'negative' parameter controlling it is first defined on the shared superclass BaseWordEmbeddingsModel in base_any2vec.py, that's where this parameter should also be primarily defined/managed

@fernandocamargoai
Copy link
Contributor Author

@gojomo
Thank you for your feedback. Is there a way for me to run just a specific unit test?

@gojomo
Copy link
Collaborator

gojomo commented Jun 15, 2018

There definitely is, I don't have the syntax handy, but IIRC it involves invoking the test_*.py file itself with command-line arguments naming the exact test(s)/test-suites to run.

@fernandocamargoai
Copy link
Contributor Author

@gojomo, I fixed the problems and created the PR.

menshikh-iv pushed a commit that referenced this issue Jun 22, 2018
…tion for `*2vec` models. Fix #2090 (#2093)

* Adding ns_exponent parameter to control the negative sampling distribution.

* Fixed a code style problem.

* Updated the documentation of the ns_exponent parameter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple)
Projects
None yet
2 participants