-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fix backward incompatibility due to random_state
#1327
[WIP] Fix backward incompatibility due to random_state
#1327
Conversation
random_state
random_state
Please add unit tests |
@tmylk Yes. Working on adding tests. |
@tmylk @menshikh-iv I have added unit tests for ensuring that backward compatibility is not broken due to In the following 2 commits, I also had to add checks And the last commit is failing for Python 3.5 because of the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic changes needed
gensim/models/ldamodel.py
Outdated
@@ -1004,6 +1004,11 @@ def save(self, fname, ignore=['state', 'dispatcher'], separately=None, *args, ** | |||
""" | |||
if self.state is not None: | |||
self.state.save(utils.smart_extension(fname, '.state'), *args, **kwargs) | |||
|
|||
# Save 'random_state' separately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of saving it separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk I think It's better for case "file don't exist" (and more flexible)
gensim/models/ldamodel.py
Outdated
else: | ||
result.id2word = None | ||
logging.warning("random_state not stored on disk so using default value") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if random_state is not stored that means that it is a new version of the model and it is going to be loaded in the main pickle load. Please change the logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk Your suggestion is a bit different from the earlier discussion on the issue. Hence, I want to make sure I understand the desired solution before making the changes.
If I understand it correctly, what you are suggesting is :
- If there is indeed a file with the extension
.random_state
present on disk, this means that the model was saved using a pre-0.13.2 version of Gensim. So we use this file to setresult.random_state
at the time of loading. - However, if there is no such file present on disk, then this means that the model was saved using a post-0.13.2 version of Gensim and thus
result.random_state
got set at the time of the main pickle load. So in this case, we don't need to do anything else.
But for models saved using a pre-0.13.2 version of Gensim, there was no .random_state
file created at the time of saving the model. So while loading such a model from disk, where would the .random_state
file come from in this case? Is the user responsible for creating this file explicitly in such a case? If this is true, then I believe we don't need to make any changes in the save
function for LdaModel at all and we just need to change the load
function.
Please correct me if I am wrong or missing something here. Otherwise, if this is indeed what we need, I could go ahead and make the appropriate changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a check that this indeed happened: thus result.random_state got set at the time of the main pickle load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk I believe I have understood the solution that we want. However, I have a minor doubt about where would the .random_state
file come from? Is it that the user is responsible for creating it explicitly always and we (i.e. from within the save
function) need not create it ever?
If this is true, then in case we are loading a pre-0.13.2 model and no .random_state
file exists on disk, then should we set result.random_state
using a default value like get_random_state(None)
with a logger.warning
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk Could you please respond to this query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it that the user is responsible for creating it explicitly always
I think the user should not think about additional files, he saves the whole model
If this is true, then in case we are loading a pre-0.13.2 model and no .random_state file exists on disk, then should we set result.random_state using a default value like get_random_state(None) with a logger.warning?
I think it's correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple way to do this is check that random_state was loaded, if not - you set result.random_state using a default value like get_random_state(None) with a logger.warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv Then there is no .random_state
file involved at all, correct? To summarize, the solution is :
- First, load the entire model.
- Check if
result.random_state
was set or not. For the newer (post 0.13.2) models, it would have been set through the main pickle load. For the older (pre 0.13.2) models,result.random_state
would not be set through the main pickle load so we setresult.random_state
toget_random_state(None)
.
And in this solution, we don't need to make any changes in the save
function, just the load
function.
gensim/test/test_ldamodel.py
Outdated
self.assertTrue(isinstance(i[1], six.string_types)) | ||
|
||
def testRandomStateBackwardCompatibility(self): | ||
# load a model saved using a pre-0.13.2 version of Gensim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is identical to the previous one. just one test is enough that checks all the fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I thought since these were two different issues so we'd want to put separate tests to verify both are resolved. I'll make the update according to your suggestion and remove the earlier test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to avoid code duplication
could you please test that this exact test fails in the version of the code prior to your changes? |
@tmylk The last commit now shows that all checks have passed. I guess it's sorted for now then. |
To validate this as a fix we need to see what changed. This is how Test Driven Development works for any issue:
Could you please add these 2 commits to this PR? |
@tmylk Got it. Thanks a lot for the comprehensive feedback. :) Edit : I have added the commit having only unit tests and test data, which (as it should) is failing the test which checks |
@tmylk @menshikh-iv I have made changes to the code as well. One of the tests is failing for Python 2.7 in the last commit because of Edit : The last commit now passes all the tests. |
…ents1 Added comments explaining logic for changes in PR #1327
This PR fixes issue #1082.