Fix bug where saved Phrases model did not load its connector_words #3116

aloknayak29 · 2021-04-14T19:03:32Z

Bug description: If we have saved the trained phrases model in this version i.e gensim version 4.0.0 or 4.0.1, while giving non empty connector_words. And If we try to load the saved model in this version itself. Then the connector_words of the loaded phrases model will be wrongly assigned empty frozen set. This is a functional bug and will result in some of the words not getting grouped in ngram

piskvorky

Sorry I didn't understand the motivation for this PR. What is the problem, why is this needed? Can you include a minimal reproducing example?

gensim/models/phrases.py

aloknayak29 · 2021-04-14T20:08:41Z

A reproducible example will include training data as well. Although an short example of bug will be, If you train a bigram and trigram model, and apply them on words like ["united, "states", "of", "america"], then this will output ['united_states_of_america']

trigram_phrases_model[bigram_phrases_model[['united', 'states', 'of', 'america']]]
['united_states_of_america']

But if you save these models and load them and then apply them on words like ["united, "states", "of", "america"], then this will output

['united_states', 'of', 'america']

piskvorky · 2021-04-14T20:12:11Z

Please add a unit test that fails before this PR, and succeeds after. That should demonstrate the issue clearly.

aloknayak29 · 2021-04-14T21:32:57Z

Do you need the unit test changes in the same commit or different commit over this?

piskvorky

Thanks a lot! One minor code style fix and we're good to go.

gensim/test/test_phrases.py

mpenkov · 2021-04-28T05:54:35Z

gensim/test/test_phrases.py

@@ -321,9 +321,10 @@ def test_save_load_custom_scorer(self):
    def test_save_load(self):
        """Test saving and loading a Phrases object."""
        with temporary_file("test.pkl") as fpath:
-            bigram = Phrases(self.sentences, min_count=1, threshold=1)
+            bigram = Phrases(self.sentences, min_count=1, threshold=1, connector_words=self.connector_words)


Why are we changing the behavior of an existing, passing test?

Should we not be adding a new test case here?

I wanted to create a separate test, But after reading this whole file gensim/test/test_phrases.py, I found that this test function i.e "test_save_load" should ideally be capturing this bug. Give its name "test_save_load" I assumed that this test was designed to capture the saving & loading of the whole model and this test shouldn't pass if saved connector words are not loaded. Adding it separately will increase the number of lines of code and will increase the test execution time. Another thought can be to remove this test altogether i.e not testing loading & saving of connector words. Let me know I can change the PR.

I found that this test function i.e "test_save_load" should ideally be capturing this bug

Not necessarily. The bug is loading an old model. So the new test could be equally be called test_save_load_old_model.

Adding it separately will increase the number of lines of code and will increase the test execution time.

The test is less than 10 lines long, so duplicating them isn't a huge issue. That leaves execution time.

How long does the test take to run?

Another thought can be to remove this test altogether i.e not testing loading & saving of connector words.

What would be the benefit of removing this test? Isn't it testing a valid case?

I couldn't understand your first point. After reading your first point, It seems that you have misunderstood that this bug is loading an old model. This bug occurs when we save and load the model in the same version itself. I mean to say that "test_save_load_old_model" would not describe this bug related test. string like "test_save_load_with_connector_words" can describe this bug
I agree with all of the rest of your points.
It will take the same time as test "test_save_load"
Yes its testing a valid case

string like "test_save_load_with_connector_words" can describe this bug

Sure, if that's a more suitable name, please use that instead.

Updated PR with new test method and no behaviour changing of an existing, passing test

… model of version >= 4 Added tests for asserting persistence of phrases connector_words

piskvorky · 2021-05-05T10:15:36Z

CHANGELOG.md

@@ -5,6 +5,7 @@ Changes

 - LsiModel: Only log top words that actually exist in the dictionary (PR [#3091](https://github.com/RaRe-Technologies/gensim/pull/3091), [@kmurphy4](https://github.com/kmurphy4))
 - [#3115](https://github.com/RaRe-Technologies/gensim/pull/3115): Make LSI dispatcher CLI param for number of jobs optional, by [@robguinness](https://github.com/robguinness))
+- fix bug when loading saved Phrases model (PR [#3116](https://github.com/RaRe-Technologies/gensim/pull/3116), [@aloknayak29](https://github.com/aloknayak29))


Don't we have a script that spits out a formatted list of all merged PRs automatically, during a release?

We do, but I've been keeping track of them as we merge them in smart_open, and I've found that way to be more manageable (less risk of missing PRs, and easier to see what has changed since the last release), so I'm trying out the same approach with gensim.

It's also easier to deal with this while the PR is still in my short term memory. When I'm doing the release, that memory is long gone.

OK, no problem. One advantage of the script is consistency in formatting – see for example the change here :)

Yeah, I've got a script that handles these changes, but multiple versions of it were floating around my work/home machines. It's now committed to the repo, so once we nail down the format of the entries, the formatting should no longer be an issue.

https://github.com/RaRe-Technologies/gensim/blob/develop/release/annotate_pr.py

I was talking about this script: https://github.com/RaRe-Technologies/gensim/blob/develop/release/generate_changelog.py

Yes. Once the two scripts output stuff in the same format, the consistency problem will be solved, right?

I may be misunderstanding, but why do we need two scripts? generate_changelog.py seems to be doing what we need = generate formatted CHANGELOG from all new PRs since the last release. When / why would we use annotate_pr.py?

I'm using annotate_pr.py for each PR (since a few week ago). I'll use generate_changelog.py for the next release. I'll keep whichever ends up being the most convenient - for now, it seems to be the former, but time will tell.

mpenkov · 2021-05-08T05:25:47Z

Thank you @aloknayak29 !

aloknayak29 force-pushed the develop branch from 39f9350 to 62b0e9c Compare April 14, 2021 19:16

piskvorky requested changes Apr 14, 2021

View reviewed changes

gensim/models/phrases.py Outdated Show resolved Hide resolved

aloknayak29 force-pushed the develop branch from 21ff138 to c0c8058 Compare April 14, 2021 21:37

piskvorky requested changes Apr 14, 2021

View reviewed changes

gensim/test/test_phrases.py Outdated Show resolved Hide resolved

aloknayak29 force-pushed the develop branch from c0c8058 to 49dfe1b Compare April 15, 2021 00:43

piskvorky approved these changes Apr 15, 2021

View reviewed changes

piskvorky requested a review from mpenkov April 15, 2021 06:35

aloknturing mentioned this pull request Apr 20, 2021

Phrases model's saved connector_words not getting loaded, while loading saved phrases model #3124

Closed

mpenkov reviewed Apr 28, 2021

View reviewed changes

aloknayak29 force-pushed the develop branch from 49dfe1b to cb97aba Compare April 28, 2021 14:35

fixed bug of connector_words not loading, while loading saved phrases…

8362a0e

… model of version >= 4 Added tests for asserting persistence of phrases connector_words

aloknayak29 force-pushed the develop branch from cb97aba to 8362a0e Compare April 28, 2021 15:31

mpenkov added 2 commits May 5, 2021 16:09

Update test_phrases.py

1d820e3

Update phrases.py

843e1ed

mpenkov changed the title ~~fixed functional bug of saved connector_words not getting loaded, while loading saved phrases model~~ fix bug when loading saved Phrases model May 5, 2021

Update CHANGELOG.md

9add32c

piskvorky reviewed May 5, 2021

View reviewed changes

piskvorky changed the title ~~fix bug when loading saved Phrases model~~ Fix bug where saved Phrases model did not load its connector_words May 5, 2021

mpenkov merged commit 351456b into piskvorky:develop May 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug where saved Phrases model did not load its connector_words #3116

Fix bug where saved Phrases model did not load its connector_words #3116

aloknayak29 commented Apr 14, 2021

piskvorky left a comment •

edited

Loading

aloknayak29 commented Apr 14, 2021 •

edited

Loading

piskvorky commented Apr 14, 2021

aloknayak29 commented Apr 14, 2021

piskvorky left a comment

mpenkov Apr 28, 2021

aloknayak29 Apr 28, 2021

mpenkov Apr 28, 2021 •

edited

Loading

aloknayak29 Apr 28, 2021

mpenkov Apr 28, 2021

aloknayak29 Apr 28, 2021

piskvorky May 5, 2021 •

edited

Loading

mpenkov May 5, 2021 •

edited

Loading

piskvorky May 5, 2021

mpenkov May 5, 2021

mpenkov May 5, 2021

piskvorky May 5, 2021

mpenkov May 5, 2021

piskvorky May 5, 2021 •

edited

Loading

mpenkov May 5, 2021

mpenkov commented May 8, 2021

Fix bug where saved Phrases model did not load its connector_words #3116

Fix bug where saved Phrases model did not load its connector_words #3116

Conversation

aloknayak29 commented Apr 14, 2021

piskvorky left a comment • edited Loading

Choose a reason for hiding this comment

aloknayak29 commented Apr 14, 2021 • edited Loading

piskvorky commented Apr 14, 2021

aloknayak29 commented Apr 14, 2021

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpenkov Apr 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky May 5, 2021 • edited Loading

Choose a reason for hiding this comment

mpenkov May 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky May 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpenkov commented May 8, 2021

piskvorky left a comment •

edited

Loading

aloknayak29 commented Apr 14, 2021 •

edited

Loading

mpenkov Apr 28, 2021 •

edited

Loading

piskvorky May 5, 2021 •

edited

Loading

mpenkov May 5, 2021 •

edited

Loading

piskvorky May 5, 2021 •

edited

Loading