Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix backward incompatibility due to random_state #1327

Merged
merged 11 commits into from
May 25, 2017

Conversation

chinmayapancholi13
Copy link
Contributor

This PR fixes issue #1082.

@chinmayapancholi13 chinmayapancholi13 changed the title Fix backward incompatibility due to random_state [WIP] Fix backward incompatibility due to random_state May 16, 2017
@tmylk
Copy link
Contributor

tmylk commented May 16, 2017

Please add unit tests

@chinmayapancholi13
Copy link
Contributor Author

@tmylk Yes. Working on adding tests.

@chinmayapancholi13
Copy link
Contributor Author

chinmayapancholi13 commented May 16, 2017

@tmylk @menshikh-iv I have added unit tests for ensuring that backward compatibility is not broken due to id2word and random_state in the first 2 commits in this PR.

In the following 2 commits, I also had to add checks isinstance(self.alpha, np.ndarray) and isinstance(self.eta, np.ndarray) otherwise this was leading to an error when self.alpha or self.eta were of type float (since we are checking len(self.alpha.shape) and len(self.eta.shape) here). Particularly, this was happening when I was trying to again save the pre-0.13.2 model after loading.

And the last commit is failing for Python 3.5 because of the test testPasses in gensim.test.test_ldamodel.TestLdaMulticore due to an assertion-error while comparing two floating point numbers. I haven't updated any code related to this in the PR. Should I do something about this error?

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic changes needed

@@ -1004,6 +1004,11 @@ def save(self, fname, ignore=['state', 'dispatcher'], separately=None, *args, **
"""
if self.state is not None:
self.state.save(utils.smart_extension(fname, '.state'), *args, **kwargs)

# Save 'random_state' separately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of saving it separately?

Copy link
Contributor

@menshikh-iv menshikh-iv May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk I think It's better for case "file don't exist" (and more flexible)

else:
result.id2word = None
logging.warning("random_state not stored on disk so using default value")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if random_state is not stored that means that it is a new version of the model and it is going to be loaded in the main pickle load. Please change the logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk Your suggestion is a bit different from the earlier discussion on the issue. Hence, I want to make sure I understand the desired solution before making the changes.

If I understand it correctly, what you are suggesting is :

  • If there is indeed a file with the extension .random_state present on disk, this means that the model was saved using a pre-0.13.2 version of Gensim. So we use this file to set result.random_state at the time of loading.
  • However, if there is no such file present on disk, then this means that the model was saved using a post-0.13.2 version of Gensim and thus result.random_state got set at the time of the main pickle load. So in this case, we don't need to do anything else.

But for models saved using a pre-0.13.2 version of Gensim, there was no .random_state file created at the time of saving the model. So while loading such a model from disk, where would the .random_state file come from in this case? Is the user responsible for creating this file explicitly in such a case? If this is true, then I believe we don't need to make any changes in the save function for LdaModel at all and we just need to change the load function.

Please correct me if I am wrong or missing something here. Otherwise, if this is indeed what we need, I could go ahead and make the appropriate changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a check that this indeed happened: thus result.random_state got set at the time of the main pickle load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk I believe I have understood the solution that we want. However, I have a minor doubt about where would the .random_state file come from? Is it that the user is responsible for creating it explicitly always and we (i.e. from within the save function) need not create it ever?
If this is true, then in case we are loading a pre-0.13.2 model and no .random_state file exists on disk, then should we set result.random_state using a default value like get_random_state(None) with a logger.warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk Could you please respond to this query?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it that the user is responsible for creating it explicitly always

I think the user should not think about additional files, he saves the whole model

If this is true, then in case we are loading a pre-0.13.2 model and no .random_state file exists on disk, then should we set result.random_state using a default value like get_random_state(None) with a logger.warning?

I think it's correct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simple way to do this is check that random_state was loaded, if not - you set result.random_state using a default value like get_random_state(None) with a logger.warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv Then there is no .random_state file involved at all, correct? To summarize, the solution is :

  • First, load the entire model.
  • Check if result.random_state was set or not. For the newer (post 0.13.2) models, it would have been set through the main pickle load. For the older (pre 0.13.2) models, result.random_state would not be set through the main pickle load so we set result.random_state to get_random_state(None).

And in this solution, we don't need to make any changes in the save function, just the load function.

self.assertTrue(isinstance(i[1], six.string_types))

def testRandomStateBackwardCompatibility(self):
# load a model saved using a pre-0.13.2 version of Gensim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is identical to the previous one. just one test is enough that checks all the fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I thought since these were two different issues so we'd want to put separate tests to verify both are resolved. I'll make the update according to your suggestion and remove the earlier test here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to avoid code duplication

@tmylk
Copy link
Contributor

tmylk commented May 17, 2017

could you please test that this exact test fails in the version of the code prior to your changes?
Please create a commit with only the test/test_data and zero changes.

@chinmayapancholi13
Copy link
Contributor Author

chinmayapancholi13 commented May 17, 2017

@tmylk The last commit now shows that all checks have passed. I guess it's sorted for now then.
I am making the remaining changes as per your suggestions above.

@tmylk
Copy link
Contributor

tmylk commented May 17, 2017

To validate this as a fix we need to see what changed. This is how Test Driven Development works for any issue:

  1. write new tests. See them fail. You have already written the tests, so please create a commit with only the test/test_data and zero code changes.

  2. Write new code. see tests pass. This is what you have done in the last commit.

Could you please add these 2 commits to this PR?

@chinmayapancholi13
Copy link
Contributor Author

chinmayapancholi13 commented May 17, 2017

@tmylk Got it. Thanks a lot for the comprehensive feedback. :)
I am adding these 2 commits to the PR.

Edit : I have added the commit having only unit tests and test data, which (as it should) is failing the test which checks random_state backward compatiblity. Could you please respond to my query for your review comments above before I make the remaining changes in the code?

@chinmayapancholi13
Copy link
Contributor Author

chinmayapancholi13 commented May 23, 2017

@tmylk @menshikh-iv I have made changes to the code as well. One of the tests is failing for Python 2.7 in the last commit because of Cannot allocate memory error.

Edit : The last commit now passes all the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants