Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RandomState (Fix to issue #113) breaks backwards compatibility with old LDA models #1082

Closed
mattilyra opened this issue Jan 8, 2017 · 17 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@mattilyra
Copy link
Contributor

mattilyra commented Jan 8, 2017

LDA models saved before version 0.13.2 can not be used in version 0.13.2 and up because they do not contain random_state variable. This should be a pretty trivial fix in load functionality though.

The breaking PR is this one
2e0ed26

@piskvorky piskvorky added the bug Issue described a bug label Jan 8, 2017
@tmylk tmylk added the difficulty easy Easy issue: required small fix label Jan 8, 2017
@tmylk tmylk changed the title Fix to issue #113 breaks backwards compatibility with old models RandomState (Fix to issue #113) breaks backwards compatibility with old LDA models Jan 8, 2017
@ibrahimsharaf
Copy link
Contributor

I am interested in working on this bug, how can I start, what does "barking pull request" mean?

@tmylk
Copy link
Contributor

tmylk commented Feb 12, 2017

@ibrahimsharaf The pull request that introduced this error is 2e0ed26

@ibrahimsharaf
Copy link
Contributor

ibrahimsharaf commented Feb 13, 2017

@tmylk should we use state_fname = utils.smart_extension(fname, '.random_state') instead of state_fname = utils.smart_extension(fname, '.state') in the load function?

@tmylk
Copy link
Contributor

tmylk commented Feb 14, 2017

@ibrahimsharaf That line is used to load saved state. It is different from random_state in this issue

@tmylk
Copy link
Contributor

tmylk commented Mar 20, 2017

@ibrahimsharaf It is strange that it can't be reproduced - trying to use an unpickled object that doesn't have a required field should cause an error. what is your code to reproduce?

@tmylk
Copy link
Contributor

tmylk commented May 2, 2017

Current status: awaiting a code fix

@chinmayapancholi13
Copy link
Contributor

@tmylk @menshikh-iv On running the code below :

from gensim import corpora, models

#save with version <=0.13.1
# texts = [['human', 'interface', 'computer'],
#  ['survey', 'user', 'computer', 'system', 'response', 'time'],
#  ['eps', 'user', 'interface', 'system'],
#  ['system', 'human', 'system', 'eps'],
#  ['user', 'response', 'time'],
#  ['trees'],
#  ['graph', 'trees'],
#  ['graph', 'minors', 'trees'],
#  ['graph', 'minors', 'survey']]
# dictionary = corpora.Dictionary(texts)
# corpus = [dictionary.doc2bow(text) for text in texts]

# model = models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)
# print(model.print_topics(num_topics=3, num_words=3))
# model.save('lda_model_saved1')

#load with version >=0.13.2
load_model = models.LdaModel.load('lda_model_saved1')
print(load_model.print_topics(num_topics=2, num_words=3))

I get the following error :

Traceback (most recent call last):
  File "1082_reproduce1.py", line 22, in <module>
    print(load_model.print_topics(num_topics=2, num_words=3))
  File "/home/chinmaya13/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 773, in print_topics
    return self.show_topics(num_topics, num_words, log=True)
  File "/home/chinmaya13/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 797, in show_topics
    sort_alpha = self.alpha + 0.0001 * self.random_state.rand(len(self.alpha))
AttributeError: 'LdaModel' object has no attribute 'random_state'

So, I believe, I am able to reproduce the issue. However, I wanted to verify the solution that we want here. So while saving, we save a separate file on disk for self.random_state (with extension ".random_state"). And while loading, we check if such a file is present on disk. If is it not, we set self.random_state with a default value like get_random_state(None).
Please correct me if I am wrong or missing something.

@menshikh-iv
Copy link
Contributor

You are right, @chinmayapancholi13. Just add logger.warning if file not present on disk

@chinmayapancholi13
Copy link
Contributor

@menshikh-iv However, when we save an LDA model, only 4 files get created currently : model_name, model_name.expElogbeta.npy, model_name.id2word and model_name.state
But we want to create another file for storing the random_state value. So, we then need to modify both save and load functions. Is this correct?

@menshikh-iv
Copy link
Contributor

@chinmayapancholi13 yep

@chinmayapancholi13
Copy link
Contributor

@menshikh-iv Great! I'll submit a PR for this shortly then.

@chinmayapancholi13
Copy link
Contributor

@menshikh-iv For pre-0.13.2 versions, two files are created while saving the model : model_name and model_name.state. In the post-0.13.2 versions, at the time of loading, the load function checks if a .id2word file exists on the disk or not. It if doesn't exist, then result.id2word is set as None. But after this, if one calls the show_topics function for the model, we get an error like :

Traceback (most recent call last):
  File "1082_reproduce1.py", line 28, in <module>
    print(load_model.print_topics(num_topics=2, num_words=3))
  File "/home/chinmaya13/GSOC/Gensim/gensim/gensim/models/basemodel.py", line 16, in print_topics
    return self.show_topics(num_topics=num_topics, num_words=num_words, log=True)
  File "/home/chinmaya13/GSOC/Gensim/gensim/gensim/models/ldamodel.py", line 792, in show_topics
    topic_ = [(self.id2word[id], topic_[id]) for id in bestn]
TypeError: 'NoneType' object has no attribute '__getitem__'

As can be inferred from the error log above, this is because self.id2word is None and we are accessing self.id2word[id] which then gives us an error. So, I wanted to confirm if this behavior is what we expect? If so, I could go ahead with the PR for fixing the random_state issue.

@tmylk
Copy link
Contributor

tmylk commented May 15, 2017

Yes, id2word was moved to separate file in #1039 to work around a Python 2/3 bug. Can we load from the main pickle in in pre-0.13.2?

@chinmayapancholi13
Copy link
Contributor

@tmylk Yes. We are able to load from the main pickle for pre-0.13.2 versions. So, result.id2word seems to be getting overwritten to None while loading a model, using a post-0.13.2 version, which has been saved using a pre-0.13.2 version. Thus, we actually want to execute this code in the load function :

        id2word_fname = utils.smart_extension(fname, '.id2word')
        if (os.path.isfile(id2word_fname)):
            try:
                result.id2word = utils.unpickle(id2word_fname)
            except Exception as e:
                logging.warning("failed to load id2word dictionary from %s: %s", id2word_fname, e)
        else:
            result.id2word = None

only when loading a model which has been saved using a post-0.13.2 version.
Is this correct? If so, we could simply add a check like :

        if not result.id2word :
            id2word_fname = utils.smart_extension(fname, '.id2word')
            if (os.path.isfile(id2word_fname)):
            .....
            .....
            .....

This is working for both the cases i.e. when the model had been saved using a pre-0.13.2 model (not result.id2word is False) and when the model had been saved using a post-0.13.2 model (not result.id2word is True).
Does this look ok to you?

@tmylk
Copy link
Contributor

tmylk commented May 15, 2017

Looks goodin theory, but waiting for the unit tests :)

@chinmayapancholi13
Copy link
Contributor

@tmylk Great! Then I'll try to address both these issues (one of id2word and one of random_state) in the same PR. Thanks for your feedback. :)

@menshikh-iv
Copy link
Contributor

Fixed in 1327

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

6 participants