-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Replace word2vec-specific implementation w/ constrained subclass of FastText #2879
Comments
Makes sense to me. I also can't think of anything – beyond performance and branding – that word2vec has over fasttext. |
A quick check of
Most of these probably have straightforward fixes, but the existing gensim |
Thanks for that check, that's a great start. Are the conclusions above still true after #2891? Bullet point 2) and 3) are especially worrying. Unless it's something trivial, we're probably not aiming to resolve this for 4.0.0. Although 4.0.0 would be the ideal place for a change like this. |
The speed gap & problems with analogies are fixed by (ready-for-merge) #2891; the duplication bloat in RAM & serialization is fixed by (probably OK for merge but still intended as a place to fix a few other load/save issues) #2892. Other as-yet-undiagnosed bloat (both W2V & FT's 'main' pickle file are larger since #2698) probably has a quick fix when I get around to looking closer. The exact recommended/best way to run without ngrams still needs a little more investigation & doc-improvement. (It's probably After that, the biggest issue would likely be ensuring |
Alright. I'll tentatively mark this as "4.0.0" but not really blocking. |
I believe that everything
Word2Vec
does can also already be done equivalently viaFastText
, with constrained options (like turning off char-ngrams). So we could potentially eliminate a lot of duplicated algorithm code & future maintenance overhead by recharacterizingFastText
as the root of our 2Vec hierarchy, instead of the originalWord2Vec
code.We'd want to ensure
FastText
gracefully handles char-ngram-disablement (by not making allocations/class-choices only required when ngrams are enabled), and makeWord2Vec
a subclass ofFastText
which hides some options, rather than the other way around withFastText
as a subclass that adds new options.User-visible APIs might not change at all. Some new ngram-enabled
Doc2Vec
options might become possible ifDoc2Vec
derives fromFastText
(like an inference that works at least a little, in some modes, even with all OOV words).(Relates to: #2852)
The text was updated successfully, but these errors were encountered: