Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support both old and new fastText model #1319

Merged
merged 19 commits into from
May 25, 2017

Conversation

prakhar2b
Copy link
Contributor

@prakhar2b prakhar2b commented May 13, 2017

Fix #1301 . Support both old and new fastText model.

@prakhar2b
Copy link
Contributor Author

prakhar2b commented May 13, 2017

test failing as data used lee_fasttext is modeled using older facebook's fastText, we need to update it too. Also, I'll handle whitespace error asap.

Note :- As mentioned in this comment, facebook made changes last week to introduce two variable magic and version , and dictionary pruning, which also causes error for all trained model while loading using gensim fasttext wrapper. This PR modifies wrapper accordingly.

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code style nitpick; thanks for the PR!

@@ -256,7 +256,7 @@ def load_binary_data(self, model_binary_file):
self.load_vectors(f)

def load_model_params(self, file_handle):
(dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@12i1d')
(_,_,dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@14i1d')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space after comma. Outer brackets not needed.

@@ -275,7 +275,7 @@ def load_dict(self, file_handle):
# Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
ntokens, = self.struct_unpack(file_handle, '@q')
ntokens,pruneidx_size = self.struct_unpack(file_handle, '@2q')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after comma.

@@ -289,6 +289,11 @@ def load_dict(self, file_handle):
assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index'
self.wv.vocab[word].count = count

for j in range(pruneidx_size):
_,_ = self.struct_unpack(file_handle,'@2i')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces after commas.

for j in range(pruneidx_size):
_,_ = self.struct_unpack(file_handle,'@2i')

_ = self.struct_unpack(file_handle,'@?')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dtto

@piskvorky
Copy link
Owner

piskvorky commented May 14, 2017

And do we continue to support the old format too? Or will everyone's code (and models) break after this change?

Are the FastText datasets that people commonly use still load-able (FB released models in many languages some time ago IIRC).

@prakhar2b
Copy link
Contributor Author

prakhar2b commented May 15, 2017

@piskvorky the fastText datasets (pretrained vectors) are updated according to the recent changes (see this update 12 days ago). The issue #1301 (about Hebrew pretrained vector) was because of these changes only. Issue 1236 (about French pretrained vector) is a glitch from facebook's side, not in gensim's code.
cc @jayantj

@jayantj
Copy link
Contributor

jayantj commented May 15, 2017

@prakhar2b I think we should definitely be supporting loading both the old and new FastText models with the gensim fasttext wrapper. FastText seem to have updated the links on their page with the newer models, however there might still be a number of people using the old models.

Also, adding tests for loading both older and newer format models would be good.

@prakhar2b prakhar2b changed the title [WIP] mismatch in bin and vec file resolved. [WIP] support both old and new fastText model May 19, 2017
@prakhar2b
Copy link
Contributor Author

@tmylk @menshikh-iv please look into the branch conflict
cc @jayantj

#Test model successfully loaded from fastText (new format) .vec and .bin files
new_model = fasttext.FastText.load_fasttext_format(self.test_new_model_file)
vocab_size, model_size = 1763, 10
self.assertEqual(self.test_new_model.wv.syn0.shape, (vocab_size, model_size))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this and the subsequent assert statements make use of new_model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.assertEqual(self.`test_new_model`.wv.syn0.shape, (vocab_size, model_size))

new_model is used here after assert statements

self.model_sanity(new_model)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so why not for the other assert statements? If I understand correctly, the whole point of the test is to make sure the model we loaded (new_model) has been loaded correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_new_model is defined here in line 38. Only new model is being checked in this function testLoadFastTextNewFormat. Please compare it with function testLoadFastTextFormat to see the context.


def load_vectors(self, file_handle):
if self.new_format:
_ = self.struct_unpack(file_handle,'@?')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment clarifying what this is for?

logger.warnings("Please report to gensim or fastText.")
ntokens= self.struct_unpack(file_handle, '@1q')
if self.new_format:
pruneidx_size = self.struct_unpack(file_handle, '@q')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pruneidx_size seems to be used in the original FastText code to load dictionaries -

void Dictionary::load(std::istream& in) {
  words_.clear();
  std::fill(word2int_.begin(), word2int_.end(), -1);
  in.read((char*) &size_, sizeof(int32_t));
  in.read((char*) &nwords_, sizeof(int32_t));
  in.read((char*) &nlabels_, sizeof(int32_t));
  in.read((char*) &ntokens_, sizeof(int64_t));
  in.read((char*) &pruneidx_size_, sizeof(int64_t));
  for (int32_t i = 0; i < size_; i++) {
    char c;
    entry e;
    while ((c = in.get()) != 0) {
      e.word.push_back(c);
    }
    in.read((char*) &e.count, sizeof(int64_t));
    in.read((char*) &e.type, sizeof(entry_type));
    words_.push_back(e);
    word2int_[find(e.word)] = i;
  }
  pruneidx_.clear();

  for (int32_t i = 0; i < pruneidx_size_; i++) {
    int32_t first;
    int32_t second;
    in.read((char*) &first, sizeof(int32_t));
    in.read((char*) &second, sizeof(int32_t));
    pruneidx_[first] = second;
  }

  initTableDiscard();
  initNgrams();
}

I'm not sure if it's present in the models we're loading or not though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for j in range(pruneidx_size):  
            _,_ = self.struct_unpack(file_handle,'@2i')

Presence (or absence) of these two lines doesn't affect the code.

Copy link
Contributor

@jayantj jayantj May 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see those two lines in the code. Do you mean pruneidx_size is always zero, and it doesn't matter whether we keep those lines or not?
If yes, we should add an assert statement, since if pruneidx_size isn't zero and we don't read the corresponding bytes, the subsequent bytes are not going to be parsed correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's -1 actually. Added these two lines in the new commit. I think we should not use assert statement here as some models might have non-negative values, so adding these two lines should be sufficient.

def load_dict(self, file_handle):
(vocab_size, nwords, _) = self.struct_unpack(file_handle, '@3i')
def load_dict(self, file_handle, encoding='utf8'):
vocab_size, nwords, _ = self.struct_unpack(file_handle, '@3i')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer keeping the changes related to the issue with the french wiki in a separate PR. We don't want those changes to block this PR from being merged.
In general, I think it's good practice to keep fixes for different issues in different branches/PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separated the PRs. Thanks :)

if magic == 793712314: # newer format
self.new_format = True
dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t = self.struct_unpack(file_handle, '@12i1d')
else: # older format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to set self.new_format = False inside this else, rather than inside initialize_word_vectors

@@ -256,7 +257,14 @@ def load_binary_data(self, model_binary_file):
self.load_vectors(f)

def load_model_params(self, file_handle):
(dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@12i1d')
magic, v= self.struct_unpack(file_handle, '@2i')
if magic == 793712314: # newer format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to store this value in a global variable. FASTTEXT_MAGIC_HEADER or something similar.

@@ -256,7 +257,14 @@ def load_binary_data(self, model_binary_file):
self.load_vectors(f)

def load_model_params(self, file_handle):
(dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@12i1d')
magic, v= self.struct_unpack(file_handle, '@2i')
Copy link
Contributor

@jayantj jayantj May 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style - keep 1 space before and after =
Also, version would be preferable to v.

@jayantj
Copy link
Contributor

jayantj commented May 22, 2017

Left some comments, but apart from that, looks good!
Thanks for the fix.

What issues are you facing with the branch conflict?

@prakhar2b
Copy link
Contributor Author

@jayantj updated the PR according to review. Check errors are in the vec file uploaded.
cc @tmylk

@tmylk
Copy link
Contributor

tmylk commented May 22, 2017

The old fasttext tests should still pass - strange to see them broken.
Also please add .vec to flake8 ignore list

@xiaobaozi34
Copy link

I pulled 'mismatch' branch, and re-install gensim, re-run the FastText_Tutorial.ipynb, but still got the same error:

AssertionError                            Traceback (most recent call last)
<ipython-input-3-993376a144dd> in <module>()
      7 lee_train_file = data_dir + 'lee_background.cor'
      8 
----> 9 model = FastText.train(ft_home, lee_train_file)
     10 
     11 print(model)

/Users/wqb921/anaconda2/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.pyc in train(cls, ft_path, corpus_file, output_file, model, size, alpha, window, min_count, loss, sample, negative, iter, min_n, max_n, sorted_vocab, threads)
    209 
    210         output = utils.check_output(args=cmd)
--> 211         model = cls.load_fasttext_format(output_file)
    212         cls.delete_training_files(output_file)
    213         return model

/Users/wqb921/anaconda2/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.pyc in load_fasttext_format(cls, model_file, encoding)
    237         model = cls()
    238         model.wv = cls.load_word2vec_format('%s.vec' % model_file, encoding=encoding)
--> 239         model.load_binary_data('%s.bin' % model_file, encoding=encoding)
    240         return model
    241 

/Users/wqb921/anaconda2/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.pyc in load_binary_data(self, model_binary_file, encoding)
    254         with utils.smart_open(model_binary_file, 'rb') as f:
    255             self.load_model_params(f)
--> 256             self.load_dict(f, encoding=encoding)
    257             self.load_vectors(f)
    258 

/Users/wqb921/anaconda2/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.pyc in load_dict(self, file_handle, encoding)
    275         (vocab_size, nwords, _) = self.struct_unpack(file_handle, '@3i')
    276         # Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
--> 277         assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
    278         assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
    279         ntokens, = self.struct_unpack(file_handle, '@q')

AssertionError: mismatch between vocab sizes```

@prakhar2b
Copy link
Contributor Author

prakhar2b commented May 23, 2017

ohh, the last PR was only a one word change, so in excitement I entered True instead of False, and committed without testing. The old fasttext tests are not failing now. But, the new fasttext models should not have thrown error anyway.

The only error is about assigned variable never being used. How should I handle it, it is required @tmylk

@xiaobaozi34 I checked, it's running perfectly with this PR on my system. Could you please re-run it with logging and show me the logs please. Thank you

# Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
ntokens, = self.struct_unpack(file_handle, '@q')
ntokens = self.struct_unpack(file_handle, '@1q')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could change this to simply self.struct_unpack(file_handle, '@1q') # number of tokens - unused to remove the unused variable.

assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index'
self.wv.vocab[word].count = count

if self.new_format:
for j in range(pruneidx_size):
_, _ = self.struct_unpack(file_handle, '@2i')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

def load_vectors(self, file_handle):
if self.new_format:
_ = self.struct_unpack(file_handle, '@?') # bool quant_input in fasttext.cc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@jayantj
Copy link
Contributor

jayantj commented May 23, 2017

@prakhar2b Left a couple of minor comments re failing build, we should be good to go after that. Thanks for taking care of this.

A couple of tips for the future -

  1. For issues like these, it's good to follow TDD - commit a failing test first, then commit the changes required to fix it - look at this comment [WIP] Fix backward incompatibility due to random_state #1327 (comment)
    It ensures the changes correctly fix the issue, and also quickens up the development process for you personally.

  2. It's good practice to ensure commit messages actually reflect all the changes made in the commit - if a commit makes more than one unrelated update, it's usually better to separate it out into multiple commits.

@@ -286,7 +286,7 @@ def load_dict(self, file_handle, encoding='utf8'):
# Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
ntokens = self.struct_unpack(file_handle, '@1q')
self.struct_unpack(file_handle, '@1q') # number of tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: Inline comments should be separated by at least two spaces from the statement

@prakhar2b
Copy link
Contributor Author

Finally ! 😄 @jayantj Thanks for the review and tips. Now feels like doing some real work, will only get better from here
cc @tmylk

@prakhar2b prakhar2b changed the title [WIP] support both old and new fastText model support both old and new fastText model May 23, 2017
@menshikh-iv
Copy link
Contributor

Nice work @prakhar2b 🥇

@menshikh-iv menshikh-iv merged commit 7b6afc0 into piskvorky:develop May 25, 2017
@matanox
Copy link

matanox commented May 25, 2017

Looking at the recent release train, would this probably go into some newer release within a month or so?

@menshikh-iv
Copy link
Contributor

@matanster yes, we plan the release on the next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants