RandomState (Fix to issue #113) breaks backwards compatibility with old LDA models #1082

mattilyra · 2017-01-08T11:42:33Z

LDA models saved before version 0.13.2 can not be used in version 0.13.2 and up because they do not contain random_state variable. This should be a pretty trivial fix in load functionality though.

The breaking PR is this one
2e0ed26

The text was updated successfully, but these errors were encountered:

ibrahimsharaf · 2017-02-11T02:21:38Z

I am interested in working on this bug, how can I start, what does "barking pull request" mean?

tmylk · 2017-02-12T14:27:04Z

@ibrahimsharaf The pull request that introduced this error is 2e0ed26

ibrahimsharaf · 2017-02-13T01:00:22Z

@tmylk should we use state_fname = utils.smart_extension(fname, '.random_state') instead of state_fname = utils.smart_extension(fname, '.state') in the load function?

tmylk · 2017-02-14T19:42:35Z

@ibrahimsharaf That line is used to load saved state. It is different from random_state in this issue

tmylk · 2017-03-20T20:11:25Z

@ibrahimsharaf It is strange that it can't be reproduced - trying to use an unpickled object that doesn't have a required field should cause an error. what is your code to reproduce?

tmylk · 2017-05-02T23:54:04Z

Current status: awaiting a code fix

chinmayapancholi13 · 2017-05-12T19:25:34Z

@tmylk @menshikh-iv On running the code below :

from gensim import corpora, models

#save with version <=0.13.1
# texts = [['human', 'interface', 'computer'],
#  ['survey', 'user', 'computer', 'system', 'response', 'time'],
#  ['eps', 'user', 'interface', 'system'],
#  ['system', 'human', 'system', 'eps'],
#  ['user', 'response', 'time'],
#  ['trees'],
#  ['graph', 'trees'],
#  ['graph', 'minors', 'trees'],
#  ['graph', 'minors', 'survey']]
# dictionary = corpora.Dictionary(texts)
# corpus = [dictionary.doc2bow(text) for text in texts]

# model = models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)
# print(model.print_topics(num_topics=3, num_words=3))
# model.save('lda_model_saved1')

#load with version >=0.13.2
load_model = models.LdaModel.load('lda_model_saved1')
print(load_model.print_topics(num_topics=2, num_words=3))

I get the following error :

Traceback (most recent call last):
  File "1082_reproduce1.py", line 22, in <module>
    print(load_model.print_topics(num_topics=2, num_words=3))
  File "/home/chinmaya13/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 773, in print_topics
    return self.show_topics(num_topics, num_words, log=True)
  File "/home/chinmaya13/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 797, in show_topics
    sort_alpha = self.alpha + 0.0001 * self.random_state.rand(len(self.alpha))
AttributeError: 'LdaModel' object has no attribute 'random_state'

So, I believe, I am able to reproduce the issue. However, I wanted to verify the solution that we want here. So while saving, we save a separate file on disk for self.random_state (with extension ".random_state"). And while loading, we check if such a file is present on disk. If is it not, we set self.random_state with a default value like get_random_state(None).
Please correct me if I am wrong or missing something.

menshikh-iv · 2017-05-15T14:23:32Z

You are right, @chinmayapancholi13. Just add logger.warning if file not present on disk

chinmayapancholi13 · 2017-05-15T14:32:55Z

@menshikh-iv However, when we save an LDA model, only 4 files get created currently : model_name, model_name.expElogbeta.npy, model_name.id2word and model_name.state
But we want to create another file for storing the random_state value. So, we then need to modify both save and load functions. Is this correct?

menshikh-iv · 2017-05-15T14:38:14Z

@chinmayapancholi13 yep

chinmayapancholi13 · 2017-05-15T14:40:25Z

@menshikh-iv Great! I'll submit a PR for this shortly then.

chinmayapancholi13 · 2017-05-15T19:30:33Z

@menshikh-iv For pre-0.13.2 versions, two files are created while saving the model : model_name and model_name.state. In the post-0.13.2 versions, at the time of loading, the load function checks if a .id2word file exists on the disk or not. It if doesn't exist, then result.id2word is set as None. But after this, if one calls the show_topics function for the model, we get an error like :

Traceback (most recent call last):
  File "1082_reproduce1.py", line 28, in <module>
    print(load_model.print_topics(num_topics=2, num_words=3))
  File "/home/chinmaya13/GSOC/Gensim/gensim/gensim/models/basemodel.py", line 16, in print_topics
    return self.show_topics(num_topics=num_topics, num_words=num_words, log=True)
  File "/home/chinmaya13/GSOC/Gensim/gensim/gensim/models/ldamodel.py", line 792, in show_topics
    topic_ = [(self.id2word[id], topic_[id]) for id in bestn]
TypeError: 'NoneType' object has no attribute '__getitem__'

As can be inferred from the error log above, this is because self.id2word is None and we are accessing self.id2word[id] which then gives us an error. So, I wanted to confirm if this behavior is what we expect? If so, I could go ahead with the PR for fixing the random_state issue.

tmylk · 2017-05-15T21:01:59Z

Yes, id2word was moved to separate file in #1039 to work around a Python 2/3 bug. Can we load from the main pickle in in pre-0.13.2?

chinmayapancholi13 · 2017-05-15T21:36:50Z

@tmylk Yes. We are able to load from the main pickle for pre-0.13.2 versions. So, result.id2word seems to be getting overwritten to None while loading a model, using a post-0.13.2 version, which has been saved using a pre-0.13.2 version. Thus, we actually want to execute this code in the load function :

        id2word_fname = utils.smart_extension(fname, '.id2word')
        if (os.path.isfile(id2word_fname)):
            try:
                result.id2word = utils.unpickle(id2word_fname)
            except Exception as e:
                logging.warning("failed to load id2word dictionary from %s: %s", id2word_fname, e)
        else:
            result.id2word = None

only when loading a model which has been saved using a post-0.13.2 version.
Is this correct? If so, we could simply add a check like :

        if not result.id2word :
            id2word_fname = utils.smart_extension(fname, '.id2word')
            if (os.path.isfile(id2word_fname)):
            .....
            .....
            .....

This is working for both the cases i.e. when the model had been saved using a pre-0.13.2 model (not result.id2word is False) and when the model had been saved using a post-0.13.2 model (not result.id2word is True).
Does this look ok to you?

tmylk · 2017-05-15T22:01:39Z

Looks goodin theory, but waiting for the unit tests :)

chinmayapancholi13 · 2017-05-15T22:12:07Z

@tmylk Great! Then I'll try to address both these issues (one of id2word and one of random_state) in the same PR. Thanks for your feedback. :)

menshikh-iv · 2017-05-26T05:46:09Z

Fixed in 1327

piskvorky added the bug Issue described a bug label Jan 8, 2017

tmylk added the difficulty easy Easy issue: required small fix label Jan 8, 2017

tmylk changed the title ~~Fix to issue #113 breaks backwards compatibility with old models~~ RandomState (Fix to issue #113) breaks backwards compatibility with old LDA models Jan 8, 2017

ibrahimsharaf mentioned this issue Feb 18, 2017

RandomState (Fix to issue #113) breaks backwards compatibility with old LDA models #1155

Closed

tmylk mentioned this issue Feb 24, 2017

BleiCorpus after serialize cannot be loaded #1171

Open

chinmayapancholi13 mentioned this issue May 16, 2017

[WIP] Fix backward incompatibility due to random_state #1327

Merged

menshikh-iv closed this as completed May 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomState (Fix to issue #113) breaks backwards compatibility with old LDA models #1082

RandomState (Fix to issue #113) breaks backwards compatibility with old LDA models #1082

mattilyra commented Jan 8, 2017 •

edited by gojomo

Loading

ibrahimsharaf commented Feb 11, 2017

tmylk commented Feb 12, 2017

ibrahimsharaf commented Feb 13, 2017 •

edited

Loading

tmylk commented Feb 14, 2017

tmylk commented Mar 20, 2017

tmylk commented May 2, 2017

chinmayapancholi13 commented May 12, 2017

menshikh-iv commented May 15, 2017

chinmayapancholi13 commented May 15, 2017

menshikh-iv commented May 15, 2017

chinmayapancholi13 commented May 15, 2017

chinmayapancholi13 commented May 15, 2017

tmylk commented May 15, 2017

chinmayapancholi13 commented May 15, 2017

tmylk commented May 15, 2017

chinmayapancholi13 commented May 15, 2017

menshikh-iv commented May 26, 2017

RandomState (Fix to issue #113) breaks backwards compatibility with old LDA models #1082

RandomState (Fix to issue #113) breaks backwards compatibility with old LDA models #1082

Comments

mattilyra commented Jan 8, 2017 • edited by gojomo Loading

ibrahimsharaf commented Feb 11, 2017

tmylk commented Feb 12, 2017

ibrahimsharaf commented Feb 13, 2017 • edited Loading

tmylk commented Feb 14, 2017

tmylk commented Mar 20, 2017

tmylk commented May 2, 2017

chinmayapancholi13 commented May 12, 2017

menshikh-iv commented May 15, 2017

chinmayapancholi13 commented May 15, 2017

menshikh-iv commented May 15, 2017

chinmayapancholi13 commented May 15, 2017

chinmayapancholi13 commented May 15, 2017

tmylk commented May 15, 2017

chinmayapancholi13 commented May 15, 2017

tmylk commented May 15, 2017

chinmayapancholi13 commented May 15, 2017

menshikh-iv commented May 26, 2017

mattilyra commented Jan 8, 2017 •

edited by gojomo

Loading

ibrahimsharaf commented Feb 13, 2017 •

edited

Loading