-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastText native VS original, different outputs #1940
Comments
This behavior is expected as pointed out by this comment in the unit test file. The vector for an OOV word in Gensim is likely to be slightly different compared to the vector obtained from the Original Fasttext implementation for the same OOV word. This is because the Gensim code discards un-used ngram vectors (to save memory) while the Original implementation keeps all the buckets (and hence all ngrams). So it is possible that a new OOV words might contain a few ngrams whose vectors might be missing after the discarding. Such a case is highly unlikely (depending on the bucket size and vocab size) after this PR #1916 (merged after the creation of this issue). |
Thanks for detailed explanation @manneshiva 👍 |
That we might throw out never-encountered (and thus never-trained) n-grams might be an acceptable optimization, for the case where a model is fully trained inside gensim. (After all, if needing arbitrary random untrained vectors for those n-grams later, they can be created later.) But, to discard them from an externally-trained, loaded, static model, and thus get different vectors than the original FastText in what should be a completely deterministic process, strikes me as a deviation from expected behavior, and thus a bug, despite any explanation. Your thoughts, @piskvorky? |
In general we want to stick to the original, yes. @manneshiva what does "unused" mean for externally loaded models? How much do we gain by modifying the static model's default behaviour there? I don't have enough intuition about the trade-offs, the pros and cons. |
I guess the expected behavior is that, if I load a model, all the vectors in that model are actually used in the end. I think deleting parts of a loaded binary model is not ideal, especially because it doesn't seem to be clearly pointed out anywhere. So the cons seem relatively clear - you are loading a model that may have been evaluated to have a certain quality, but under gensim you are no longer guaranteed that the quality will hold up. The pros are hard to judge for me, I'm not sure how hard it would be to adapt the code so this doesn't happen. And like @piskvorky I'm also interested where the determination of "unused" subword vectors is actually made, since there's no training corpus in this setting. Are there currently plans to change this? |
|
Ah, if it's just untrained garbage, we should definitely discard that. If someone relies on an exact reproducibility from random/untrained vectors, their app is broken anyway. That's not the kind of compatibility we care about. @manneshiva I assume the determination of used/unused is straightforward? Or is any guessing/heuristics involved, any room for error? |
I would think that if some other FastText implementation (like the original) saves out 'untrained garbage' in its serialized model, and loads-and-uses such noise in its subsequent OOV calculations after re-loading, so that it affects (reproducible) evaluations of the frozen model, then we're not really format compatible if we make the independent decision to discard that noise. And, we'll get a continuing tail of "what's up with this?" questions/bug-reports from that decision to be arbitrarily different in how we load a (frozen, completed, original-tool) model. |
Along with reproducibility issues, another point of discussion over this was that there is a memory/speed trade-off involved (during loading). Quoting from a previous comment - "For relatively small vocab sizes (~200k), the steady-state memory usage is 1.1 GB lower than it would be if we chose to keep all ngram vectors as is. (for 300-d vectors). This is at the cost of significantly increased loading time. Conversely, for large vocab sizes (like for Wikipedia models), we don't reduce memory usage, while also causing much higher load times. (as @gojomo rightly pointed out) In case the common use case is indeed loading large models, it might make sense to store ngram vectors as is, without trying to discard any unused ones." For this, Shiva proposed this solution here, which, IMO gives us the best of both worlds, while also providing exactly same behaviour by default as the original FastText, which should reduce the follow up questions/bug reports we get about this. The solution as @manneshiva described it - "I feel we should give the user an option -- discard_unused_ngrams to save memory, which by default could be False. Since the memory saved for small vocab models is significant (owing to a fewer number of total used ngrmas), this should be helpful for a user trying to load a small vocab model with limited RAM." @manneshiva @piskvorky @gojomo @menshikh-iv do you see any potential issues with this approach? if not, I think we should go ahead with it. @gojomo has raised a valid concern that in theory, our heuristic for determining unused vectors could fail, if the serialized model has had the vocabulary trimmed after training and before serializing. However, FastText doesn't do anything of the sort (and neither do any other models in my experience), so I think it's a reasonable assumption to make. A note/info level log in the code could be useful |
I agree with #1261 (comment) and @jayantj (looks like best variant). We also definitely need to update documentation in this place (for avoid user confusion). |
@manneshiva @gojomo Hi, thanks for your answers. I am encountering the similar problem with the author. I am confused about what this 'never-encountered (and thus never-trained) n-grams' comes from ? Is it initialized as a random vector for each bucket at the beginning? Thanks! |
@Alice-Ke yes, all ngram vectors are initialized randomly before training. The gensim implementation throws the ones away that are not used by any of the known words. However, if you now look up the vector for a previously unknown word which contains some new ngram, gensim will return wrong results as it threw away the respective ngram vector. Further difference likely comes from possibly wrong handling of unicode characters: #2059 |
Intro
As the person mentioned in mailing list, he receives different result with a pre-trained model with gensim code & original facebook code.
How to reproduce
wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip unzip v0.1.0.zip cd fastText-0.1.0 make
bin+text
linkExpected Results
Vectors for "hello" and "someundefinedword" exactly same (from gensim & Facebook)
Actual result
Exactly same vectors for "hello", but different for "someundefinedword"
CC: @manneshiva
The text was updated successfully, but these errors were encountered: