Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where saved Phrases model did not load its connector_words #3116

Merged
merged 4 commits into from
May 8, 2021

Conversation

aloknayak29
Copy link
Contributor

Bug description: If we have saved the trained phrases model in this version i.e gensim version 4.0.0 or 4.0.1, while giving non empty connector_words. And If we try to load the saved model in this version itself. Then the connector_words of the loaded phrases model will be wrongly assigned empty frozen set. This is a functional bug and will result in some of the words not getting grouped in ngram

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't understand the motivation for this PR. What is the problem, why is this needed? Can you include a minimal reproducing example?

gensim/models/phrases.py Outdated Show resolved Hide resolved
@aloknayak29
Copy link
Contributor Author

aloknayak29 commented Apr 14, 2021

A reproducible example will include training data as well. Although an short example of bug will be, If you train a bigram and trigram model, and apply them on words like ["united, "states", "of", "america"], then this will output ['united_states_of_america']

trigram_phrases_model[bigram_phrases_model[['united', 'states', 'of', 'america']]]
['united_states_of_america']

But if you save these models and load them and then apply them on words like ["united, "states", "of", "america"], then this will output

['united_states', 'of', 'america']

@piskvorky
Copy link
Owner

Please add a unit test that fails before this PR, and succeeds after. That should demonstrate the issue clearly.

@aloknayak29
Copy link
Contributor Author

Do you need the unit test changes in the same commit or different commit over this?

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! One minor code style fix and we're good to go.

gensim/test/test_phrases.py Outdated Show resolved Hide resolved
@@ -321,9 +321,10 @@ def test_save_load_custom_scorer(self):
def test_save_load(self):
"""Test saving and loading a Phrases object."""
with temporary_file("test.pkl") as fpath:
bigram = Phrases(self.sentences, min_count=1, threshold=1)
bigram = Phrases(self.sentences, min_count=1, threshold=1, connector_words=self.connector_words)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing the behavior of an existing, passing test?

Should we not be adding a new test case here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to create a separate test, But after reading this whole file gensim/test/test_phrases.py, I found that this test function i.e "test_save_load" should ideally be capturing this bug. Give its name "test_save_load" I assumed that this test was designed to capture the saving & loading of the whole model and this test shouldn't pass if saved connector words are not loaded. Adding it separately will increase the number of lines of code and will increase the test execution time. Another thought can be to remove this test altogether i.e not testing loading & saving of connector words. Let me know I can change the PR.

Copy link
Collaborator

@mpenkov mpenkov Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that this test function i.e "test_save_load" should ideally be capturing this bug

Not necessarily. The bug is loading an old model. So the new test could be equally be called test_save_load_old_model.

Adding it separately will increase the number of lines of code and will increase the test execution time.

The test is less than 10 lines long, so duplicating them isn't a huge issue. That leaves execution time.

How long does the test take to run?

Another thought can be to remove this test altogether i.e not testing loading & saving of connector words.

What would be the benefit of removing this test? Isn't it testing a valid case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't understand your first point. After reading your first point, It seems that you have misunderstood that this bug is loading an old model. This bug occurs when we save and load the model in the same version itself. I mean to say that "test_save_load_old_model" would not describe this bug related test. string like "test_save_load_with_connector_words" can describe this bug
I agree with all of the rest of your points.
It will take the same time as test "test_save_load"
Yes its testing a valid case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string like "test_save_load_with_connector_words" can describe this bug

Sure, if that's a more suitable name, please use that instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated PR with new test method and no behaviour changing of an existing, passing test

… model of version >= 4

Added tests for asserting persistence of phrases connector_words
@mpenkov mpenkov changed the title fixed functional bug of saved connector_words not getting loaded, while loading saved phrases model fix bug when loading saved Phrases model May 5, 2021
@@ -5,6 +5,7 @@ Changes

- LsiModel: Only log top words that actually exist in the dictionary (PR [#3091](https://github.com/RaRe-Technologies/gensim/pull/3091), [@kmurphy4](https://github.com/kmurphy4))
- [#3115](https://github.com/RaRe-Technologies/gensim/pull/3115): Make LSI dispatcher CLI param for number of jobs optional, by [@robguinness](https://github.com/robguinness))
- fix bug when loading saved Phrases model (PR [#3116](https://github.com/RaRe-Technologies/gensim/pull/3116), [@aloknayak29](https://github.com/aloknayak29))
Copy link
Owner

@piskvorky piskvorky May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have a script that spits out a formatted list of all merged PRs automatically, during a release?

Copy link
Collaborator

@mpenkov mpenkov May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, but I've been keeping track of them as we merge them in smart_open, and I've found that way to be more manageable (less risk of missing PRs, and easier to see what has changed since the last release), so I'm trying out the same approach with gensim.

It's also easier to deal with this while the PR is still in my short term memory. When I'm doing the release, that memory is long gone.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, no problem. One advantage of the script is consistency in formatting – see for example the change here :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've got a script that handles these changes, but multiple versions of it were floating around my work/home machines. It's now committed to the repo, so once we nail down the format of the entries, the formatting should no longer be an issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Once the two scripts output stuff in the same format, the consistency problem will be solved, right?

Copy link
Owner

@piskvorky piskvorky May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misunderstanding, but why do we need two scripts? generate_changelog.py seems to be doing what we need = generate formatted CHANGELOG from all new PRs since the last release. When / why would we use annotate_pr.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using annotate_pr.py for each PR (since a few week ago). I'll use generate_changelog.py for the next release. I'll keep whichever ends up being the most convenient - for now, it seems to be the former, but time will tell.

@piskvorky piskvorky changed the title fix bug when loading saved Phrases model Fix bug where saved Phrases model did not load its connector_words May 5, 2021
@mpenkov mpenkov merged commit 351456b into piskvorky:develop May 8, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented May 8, 2021

Thank you @aloknayak29 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants