Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cPickle.UnpicklingError: unpickling stack underflow #1447

Open
loretoparisi opened this issue Jun 23, 2017 · 7 comments
Open

cPickle.UnpicklingError: unpickling stack underflow #1447

loretoparisi opened this issue Jun 23, 2017 · 7 comments
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@loretoparisi
Copy link
Contributor

I get this error while loading wiki.en.vec from FastText pre-trained Word2Vec model. See here for this model.

2017-06-23 16:41:40,834 : INFO : loading Word2Vec object from /Volumes/Dataset/word2vec/wiki.en/wiki.en.vec
Traceback (most recent call last):
  File "loadlyricsmodel.py", line 45, in <module>
    model = Word2Vec.load( model_filepath )
  File "/Users/loretoparisi/Documents/Projects/word2vec/.env/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1382, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
  File "/Users/loretoparisi/Documents/Projects/word2vec/.env/lib/python2.7/site-packages/gensim/utils.py", line 271, in load
    obj = unpickle(fname)
  File "/Users/loretoparisi/Documents/Projects/word2vec/.env/lib/python2.7/site-packages/gensim/utils.py", line 935, in unpickle
    return _pickle.loads(f.read())
cPickle.UnpicklingError: unpickling stack underflow

loaded with

model = Word2Vec.load( model_filepath )

I'm using

gensim-2.2.0-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
@gojomo
Copy link
Collaborator

gojomo commented Jun 23, 2017

Word2Vec.load() only loads models saved from gensim. (It uses Python pickling.)

I believe that .vec file is in the format used by the original Google word2vec.c (and now FastText) for its top-level vectors, so KeyedVectors.load_word2vec_format() may work, perhaps with a binary=False parameter.

The method gensim.models.wrappers.fasttext.FastText.load_fasttext_format() may also be relevant to bring in ngrams for OOV word vector synthesis may by of interest too... but I'm not sure if it's yet doing the right thing in the released gensim, as compared to PR-in-progress #1341.

@menshikh-iv
Copy link
Contributor

@jayantj @prakhar2b wdyt?

@prakhar2b
Copy link
Contributor

@gojomo yes, KeyedVectors.load_word2vec_format() will definitely work here, and also binary=False is default parameter.

As for OOV word synthesis, what do you mean by not sure if it's yet doing the right thing in the released gensim. I think for OOV, we need n-gram informations which is provided in .bin file.

As of now, gensim.models.wrappers.fasttext.FastText.load_fasttext_format() is used to load complete model for this purpose using both vec and bin files. With PR#1341, we will need only bin file, rest all functionalities will remain same I believe.

cc @jayantj @menshikh-iv

@jayantj
Copy link
Contributor

jayantj commented Jun 26, 2017

Yes, with the .bin AND the .vec file, you can load the complete model using -

from gensim.models.wrappers.fasttext import FastText
model = FastText.load_fasttext_format('/path/to/model')  # without the .bin/.vec extension

With the .vec file, you can load only the word vectors (and not the out-of-vocab word information) using -

from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('/path/to/model.vec')  # with the .vec extension

@loretoparisi
Copy link
Contributor Author

@jayantj Thank, let me try first with the load_fasttext_format and FastText wrapper

@gojomo
Copy link
Collaborator

gojomo commented Jun 28, 2017

@prakhar2b My "not sure" comment was regarding to some discussion I saw on another issue or PR in progress, perhaps the one that's also discussing whether the discarding-of-untrained-ngrams is a necessary optimization – I had the impression our calculation might be diverging from the original FB fasttext on some (perhaps just OOV) words. (And even if that's defensible, because the untrained ngrams are still just random vectors, it might not be the 'right thing' overall because it may violate user expectations that whether loaded into original FT code, or gensim FT code, OOV words get the same vectors from the same loaded model.)

@piskvorky piskvorky added the bug Issue described a bug label Aug 31, 2017
@piskvorky
Copy link
Owner

piskvorky commented Aug 31, 2017

We definitely want to follow whatever the original FT does -- the path of least surprise for anyone migrating / trying both.

@menshikh-iv menshikh-iv added the difficulty medium Medium issue: required good gensim understanding & python skills label Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

No branches or pull requests

6 participants