Proposal: Replace word2vec-specific implementation w/ constrained subclass of FastText #2879

gojomo · 2020-07-12T02:14:23Z

I believe that everything Word2Vec does can also already be done equivalently via FastText, with constrained options (like turning off char-ngrams). So we could potentially eliminate a lot of duplicated algorithm code & future maintenance overhead by recharacterizing FastText as the root of our 2Vec hierarchy, instead of the original Word2Vec code.

We'd want to ensure FastText gracefully handles char-ngram-disablement (by not making allocations/class-choices only required when ngrams are enabled), and make Word2Vec a subclass of FastText which hides some options, rather than the other way around with FastText as a subclass that adds new options.

User-visible APIs might not change at all. Some new ngram-enabled Doc2Vec options might become possible if Doc2Vec derives from FastText (like an inference that works at least a little, in some modes, even with all OOV words).

(Relates to: #2852)

The text was updated successfully, but these errors were encountered:

piskvorky · 2020-07-12T08:09:02Z

Makes sense to me. I also can't think of anything – beyond performance and branding – that word2vec has over fasttext.

gojomo · 2020-07-13T23:17:57Z

A quick check of Word2Vec (all defaults) & FastText (all defaults except n-grams disabled) has already shown a few issues that will need addressing:

it's a bit muddled how ngrams should be turned off - different comments/code imply either (1) max_n less than min_n; (2) char_ngrams=0; (3) char_ngrams==0 and max_n==0 (in FastText.__init__()).
the train() step that takes 55s in plain Word2Vec takes 84s in FastText - a lot of overhead for a disabled feature
an evaluate_word_analogies() w/ question_words.txt fails everything on the FT-without-ngrams model, rather than roughly matching the should-be-equivalent Word2Vec - unclear why
even without ngrams, the FastText model still maintains its two separate sets of per-word vectors: the raw whole-word vector, and the whole-word-plus-ngram-enrichment calculation – so RAM usage & on-disk storage is more expansive than would be best

Most of these probably have straightforward fixes, but the existing gensim FastText implementation isn't yet a simple drop-in replacement for Word2Vec as one might have hoped. (Potentially, rough parity in ngrams-disabled performance and quality-evaluations could have been part of the FT code's original review/testing.)

piskvorky · 2020-07-26T15:59:13Z

Thanks for that check, that's a great start. Are the conclusions above still true after #2891?

Bullet point 2) and 3) are especially worrying. Unless it's something trivial, we're probably not aiming to resolve this for 4.0.0. Although 4.0.0 would be the ideal place for a change like this.

gojomo · 2020-07-27T05:00:31Z

The speed gap & problems with analogies are fixed by (ready-for-merge) #2891; the duplication bloat in RAM & serialization is fixed by (probably OK for merge but still intended as a place to fix a few other load/save issues) #2892. Other as-yet-undiagnosed bloat (both W2V & FT's 'main' pickle file are larger since #2698) probably has a quick fix when I get around to looking closer.

The exact recommended/best way to run without ngrams still needs a little more investigation & doc-improvement. (It's probably max_n=0 but bucket=0 might be as good and maybe should be equally supported.)

After that, the biggest issue would likely be ensuring Doc2Vec could live as a FastText subclass (instead of Word2Vec), & ironing out any little oddities from Word2Vec & Doc2Vec relying on FT, but possibly hiding/suppressing some of its unneeded aspects. I don't think that'd be too big of a problem, but they may be gotchas to-be-revealed.

piskvorky · 2020-07-27T07:44:51Z

Alright. I'll tentatively mark this as "4.0.0" but not really blocking.

This was referenced Jul 16, 2020

KeyedVectors & *2Vec API streamlining, consistency #2698

Merged

[MRG] Fix FastText word-vectors w/ ngrams/buckets off #2891

Merged

piskvorky mentioned this issue Jul 26, 2020

save_facebook_model() - AssertionError #2853

Closed

piskvorky added this to the 4.0.0 milestone Jul 27, 2020

piskvorky added the housekeeping internal tasks and processes label Jul 27, 2020

piskvorky removed this from the 4.0.0 milestone Sep 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Replace word2vec-specific implementation w/ constrained subclass of FastText #2879

Proposal: Replace word2vec-specific implementation w/ constrained subclass of FastText #2879

gojomo commented Jul 12, 2020

piskvorky commented Jul 12, 2020

gojomo commented Jul 13, 2020 •

edited

Loading

piskvorky commented Jul 26, 2020

gojomo commented Jul 27, 2020

piskvorky commented Jul 27, 2020

Proposal: Replace word2vec-specific implementation w/ constrained subclass of FastText #2879

Proposal: Replace word2vec-specific implementation w/ constrained subclass of FastText #2879

Comments

gojomo commented Jul 12, 2020

piskvorky commented Jul 12, 2020

gojomo commented Jul 13, 2020 • edited Loading

piskvorky commented Jul 26, 2020

gojomo commented Jul 27, 2020

piskvorky commented Jul 27, 2020

gojomo commented Jul 13, 2020 •

edited

Loading