Bucket bug for FastText.load_fasttext_format #1779

saroufimc1 · 2017-12-12T00:00:02Z

Hi,
I tried loading a pretrained model from Facebook's fasttext into gemsim using FastText.load_fasttext_format, and it looks like there is a bug to be fixed for bucket here as well.

I noticed that for bucket = 2,000,000 we had model.wv.syn0_ngrams.shape[0] = 7,221,731 instead of 2,000,000.

Also, in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/wrappers/fasttext.py

I don't see why we do have to compute the n-grams from the word vocabulary again. Aren't these already imported from the fastText .bin file?

Steps/Code/Corpus to Reproduce

from gensim.models.wrappers import FastText

First download the zip file from https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip.
Then unzip and put the file 'wiki.en.bin' in your working directory.

model = FastText.load_fasttext_format('wiki.en')

print(model.wv.syn0_ngrams.shape)[0]

Expected Results

Expected value of 2,000,000 which is the default value of bucket.

Actual Results

7,221,731 which is here equal to len(model.wv.ngrams).
In other words, it looks like there were no collusions although we had more n-grams than buckets.

Also, please note that it took 10 minutes to load the fasttext model. I wonder if some parts of the code (especially in load_vectors) are really needed.

Thanks,
Carl

Versions

import platform; print(platform.platform())
Windows-10-10.0.14393-SP0
import sys; print("Python", sys.version)
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
import numpy; print("NumPy", numpy.version)
NumPy 1.13.3
import scipy; print("SciPy", scipy.version)
SciPy 1.0.0
import gensim; print("gensim", gensim.version)
gensim 3.1.0
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0
-->

menshikh-iv · 2017-12-12T07:44:26Z

Thanks for report @saroufimc1, we'll try to reproduce and look into with @manneshiva later (file is downloaded for a very long time)

jayantj · 2017-12-12T08:01:27Z

Hi @saroufimc1 thanks for the detailed report and ideas.
syn0_ngrams can (and usually does) contain rows greater than the number of buckets. The reason for this is that the C++ FastText model stores num_words_in_vocab + bucket vectors in its input_ matrix, which corresponds to our syn0_ngrams matrix.

Re: computing ngrams, no the FastText .bin file doesn't include the ngrams themselves. But you are right, the loading time is quite high and it might be possible to get away without computing the ngrams from the vocab ourselves. Will post about it in more detail in #1261

Also, please note that it took 10 minutes to load the fasttext model. I wonder if some parts of the code (especially in load_vectors) are really needed.
Is there anything specific you had in mind?

Thanks!
Jayant

jayantj · 2017-12-12T10:29:55Z

Edit: I'm mistaken, this indeed is a bug.

manneshiva · 2017-12-12T11:13:29Z

Discussed this with @jayantj and narrowed the bug down to an error while trimming ngrams in the function init_ngrams. Will fix this bug soon.

saroufimc1 · 2017-12-12T17:55:40Z

Thanks @manneshiva !
@jayantj : Since the .bin file does not contain the word n-grams, then what does it contain since the size of the input_ matrix is num_words_in_vocab + bucket ?

In other words, num_words_in_vocab correspond to the embeddings of the vocabulary words.
What about the bucket part? What are those embeddings for in input_ matrix?

manneshiva · 2017-12-15T13:27:27Z

@saroufimc1 The .bin file does not contain information about the ngrams used. It accesses the ngram vector by using the ngram's hash to find it's index in the matrix. This is the reason it also stores the entire bucket sized matrix (does not keep track of which of these ngram vectors are used/trained) in input_ along with the vectors for vocab words.

mpenkov · 2019-02-03T10:46:49Z

Reproduced with 3.6.0:

(360.venv) mpenkov@hetrad2:~$ cat bug_wrapper.py 
import logging
import sys
import os
from gensim.models.wrappers import FastText

print('<bug.py>')
os.system('free --giga')
logging.basicConfig(level=logging.INFO)
m = FastText.load_fasttext_format('wiki.en.bin')
print('shape: %r' % (m.wv.syn0_ngrams.shape,))
os.system('free --giga')
print('</bug.py>')
(360.venv) mpenkov@hetrad2:~$ time python bug_wrapper.py
<bug.py>
              total        used        free      shared  buff/cache   available
Mem:             62          21          22           0          19          40
Swap:            31           6          25
WARNING:gensim.models.deprecated.word2vec:Slow version of gensim.models.deprecated.word2vec is being used
INFO:gensim.models.deprecated.fasttext_wrapper:loading 2519370 words for fastText model from wiki.en.bin
INFO:gensim.models.deprecated.fasttext_wrapper:loading weights for 2519370 words for fastText model from wiki.en.bin
INFO:gensim.models.deprecated.fasttext_wrapper:loaded (2519370, 300) weight matrix for fastText model from wiki.en.bin
shape: (7221731, 300)
              total        used        free      shared  buff/cache   available
Mem:             62          37           5           0          19          24
Swap:            31           6          25
</bug.py>

real    6m33.439s
user    5m51.836s
sys     0m41.828s

saroufimc1 mentioned this issue Dec 12, 2017

Improve FastText loading times #1261

Closed

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Dec 12, 2017

manneshiva mentioned this issue Dec 15, 2017

Bug Fix 1779 #1787

Closed

saroufimc1 mentioned this issue Jan 17, 2018

Fix 1779 #1843

Closed

jbaiter mentioned this issue Feb 28, 2018

Fix method estimate_memory from gensim.models.FastText & huge performance improvement. Fix #1824 #1916

Merged

7 tasks

mpenkov added the fasttext Issues related to the FastText model label Feb 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bucket bug for FastText.load_fasttext_format #1779

Bucket bug for FastText.load_fasttext_format #1779

saroufimc1 commented Dec 12, 2017 •

edited

Loading

menshikh-iv commented Dec 12, 2017

jayantj commented Dec 12, 2017 •

edited

Loading

jayantj commented Dec 12, 2017

manneshiva commented Dec 12, 2017

saroufimc1 commented Dec 12, 2017

manneshiva commented Dec 15, 2017 •

edited

Loading

mpenkov commented Feb 3, 2019

Bucket bug for FastText.load_fasttext_format #1779

Bucket bug for FastText.load_fasttext_format #1779

Comments

saroufimc1 commented Dec 12, 2017 • edited Loading

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

menshikh-iv commented Dec 12, 2017

jayantj commented Dec 12, 2017 • edited Loading

jayantj commented Dec 12, 2017

manneshiva commented Dec 12, 2017

saroufimc1 commented Dec 12, 2017

manneshiva commented Dec 15, 2017 • edited Loading

mpenkov commented Feb 3, 2019

saroufimc1 commented Dec 12, 2017 •

edited

Loading

jayantj commented Dec 12, 2017 •

edited

Loading

manneshiva commented Dec 15, 2017 •

edited

Loading