Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bucket bug for FastText.load_fasttext_format #1779

Open
saroufimc1 opened this issue Dec 12, 2017 · 7 comments
Open

Bucket bug for FastText.load_fasttext_format #1779

saroufimc1 opened this issue Dec 12, 2017 · 7 comments
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model

Comments

@saroufimc1
Copy link

saroufimc1 commented Dec 12, 2017

Hi,
I tried loading a pretrained model from Facebook's fasttext into gemsim using FastText.load_fasttext_format, and it looks like there is a bug to be fixed for bucket here as well.

I noticed that for bucket = 2,000,000 we had model.wv.syn0_ngrams.shape[0] = 7,221,731 instead of 2,000,000.

Also, in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/wrappers/fasttext.py

I don't see why we do have to compute the n-grams from the word vocabulary again. Aren't these already imported from the fastText .bin file?

Steps/Code/Corpus to Reproduce

from gensim.models.wrappers import FastText

First download the zip file from https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip.
Then unzip and put the file 'wiki.en.bin' in your working directory.

model = FastText.load_fasttext_format('wiki.en')

print(model.wv.syn0_ngrams.shape)[0]

Expected Results

Expected value of 2,000,000 which is the default value of bucket.

Actual Results

7,221,731 which is here equal to len(model.wv.ngrams).
In other words, it looks like there were no collusions although we had more n-grams than buckets.

Also, please note that it took 10 minutes to load the fasttext model. I wonder if some parts of the code (especially in load_vectors) are really needed.

Thanks,
Carl

Versions

import platform; print(platform.platform())
Windows-10-10.0.14393-SP0
import sys; print("Python", sys.version)
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
import numpy; print("NumPy", numpy.version)
NumPy 1.13.3
import scipy; print("SciPy", scipy.version)
SciPy 1.0.0
import gensim; print("gensim", gensim.version)
gensim 3.1.0
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0
-->

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Dec 12, 2017
@menshikh-iv
Copy link
Contributor

Thanks for report @saroufimc1, we'll try to reproduce and look into with @manneshiva later (file is downloaded for a very long time)

@jayantj
Copy link
Contributor

jayantj commented Dec 12, 2017

Hi @saroufimc1 thanks for the detailed report and ideas.
syn0_ngrams can (and usually does) contain rows greater than the number of buckets. The reason for this is that the C++ FastText model stores num_words_in_vocab + bucket vectors in its input_ matrix, which corresponds to our syn0_ngrams matrix.

Re: computing ngrams, no the FastText .bin file doesn't include the ngrams themselves. But you are right, the loading time is quite high and it might be possible to get away without computing the ngrams from the vocab ourselves. Will post about it in more detail in #1261

Also, please note that it took 10 minutes to load the fasttext model. I wonder if some parts of the code (especially in load_vectors) are really needed.
Is there anything specific you had in mind?

Thanks!
Jayant

@jayantj
Copy link
Contributor

jayantj commented Dec 12, 2017

Edit: I'm mistaken, this indeed is a bug.

@manneshiva
Copy link
Contributor

Discussed this with @jayantj and narrowed the bug down to an error while trimming ngrams in the function init_ngrams. Will fix this bug soon.

@saroufimc1
Copy link
Author

Thanks @manneshiva !
@jayantj : Since the .bin file does not contain the word n-grams, then what does it contain since the size of the input_ matrix is num_words_in_vocab + bucket ?

In other words, num_words_in_vocab correspond to the embeddings of the vocabulary words.
What about the bucket part? What are those embeddings for in input_ matrix?

@manneshiva
Copy link
Contributor

manneshiva commented Dec 15, 2017

@saroufimc1 The .bin file does not contain information about the ngrams used. It accesses the ngram vector by using the ngram's hash to find it's index in the matrix. This is the reason it also stores the entire bucket sized matrix (does not keep track of which of these ngram vectors are used/trained) in input_ along with the vectors for vocab words.

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 3, 2019

Reproduced with 3.6.0:

(360.venv) mpenkov@hetrad2:~$ cat bug_wrapper.py 
import logging
import sys
import os
from gensim.models.wrappers import FastText

print('<bug.py>')
os.system('free --giga')
logging.basicConfig(level=logging.INFO)
m = FastText.load_fasttext_format('wiki.en.bin')
print('shape: %r' % (m.wv.syn0_ngrams.shape,))
os.system('free --giga')
print('</bug.py>')
(360.venv) mpenkov@hetrad2:~$ time python bug_wrapper.py
<bug.py>
              total        used        free      shared  buff/cache   available
Mem:             62          21          22           0          19          40
Swap:            31           6          25
WARNING:gensim.models.deprecated.word2vec:Slow version of gensim.models.deprecated.word2vec is being used
INFO:gensim.models.deprecated.fasttext_wrapper:loading 2519370 words for fastText model from wiki.en.bin
INFO:gensim.models.deprecated.fasttext_wrapper:loading weights for 2519370 words for fastText model from wiki.en.bin
INFO:gensim.models.deprecated.fasttext_wrapper:loaded (2519370, 300) weight matrix for fastText model from wiki.en.bin
shape: (7221731, 300)
              total        used        free      shared  buff/cache   available
Mem:             62          37           5           0          19          24
Swap:            31           6          25
</bug.py>

real    6m33.439s
user    5m51.836s
sys     0m41.828s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model
Projects
None yet
Development

No branches or pull requests

5 participants