-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this a bug in the CBOW code or my misunderstanding? #1873
Comments
No bug. If you've already averaged all the vectors together ( See also discussion/example at https://groups.google.com/forum/#!searchin/gensim/cbow_mean|sort:date/gensim/BtN7uB1vgpc/tlvkLXqzJwAJ and prior issue #697. |
Worth a code comment if it keep tripping up people? It is rather non-obvious. |
Sure, both the Python and Cython lines where this conditional division is done could include a extra clarification that this is intentional and even an excerpt from, or link to, these longer explanations. |
Yeah, that will be great! |
I also think this is wrong and I'm not convinced by @gojomo 's logic. Backprop is basically the chain rule nothing more. So if neu1e is the derivative wrt l1 and l1 is an average of some weights, then derivative of neu1e wrt the weights would be 1/(number of words) so I also think it's a bug. If you want to convince me you have to convince me with the application of chain rule not the thing you mentioned because I guess what you said is also wrong |
@thegrimreaper1992 The current code works well, and matches the behavior (in both code & rough results) of the original word2vec.c implementation from the authors of the original Word2Vec paper. If neither my explanation nor that concordance convinces you, go ahead and try it your way, and see if anything improves. |
Thanks for your response @gojomo. So in other words maybe I'm trying to understand why the original code was implemented that way. Because by applying chain rule I guess it should be the other way around. Do you get my point here? |
Sure, but note it's not just simple backprop, but backprop through this algorithm-specific "input-prep-operation", that's composed (in this particular CBOW-average mode) by the average of all context-window vectors. So at this level, we've essentially left the standard NN model, from which we have an error-at-the-"input"-layer. But that input-layer was composed in a certain way from all the other word-vectors, and that error then needs to be meaningfully applied back to constituent word-vectors. I think my forum post – https://groups.google.com/forum/#!searchin/gensim/cbow_mean%7Csort:date/gensim/BtN7uB1vgpc/tlvkLXqzJwAJ – outlines why this form of error-divvying makes sense for the average-forward, fanout-backward case. If we were summing all the context-words, then yes, the correction would be divided equally over all sum-constituents. But since we're averaging, the same correction goes back to each – because on the next forward-prop it'll again be divided by the number-of-participating-words. |
Thanks for your response. Plus, the writters haven't mentioned anything about a special form of backprop in their paper at all anywhere. And even more specifically, what I'm saying is backed by this paper and explained there: Does this make any sense? Thanks |
My analogy of using a backprop-like adjustment to correct an average to "10" doesn't involve any sort of hidden-layer. It's just mathematical reasoning that a correction-to-an-average of magnitude N requires a full-correction-to-each-consitutent-value of magnitude *N& – not a correction-divided-by-the-number-of-summands like N/count. It's not a special form of backprop; it's the CBOW-with-average algorithm they describe, which uses an NN as part of its process, but not its entire process. The preparation of the "input layer" is either via picking one context word-vector (skip-gram) or an average-of-context-word-vectors (CBOW). After the NN has run its course, and corrected the input-layer, that then has to applied to the constituent word-vectors. That's trivial in the skip-gram case (apply the whole correction to the one input word), but then varies in the CBOW case. In the rarer CBOW-with-summation mode, the error needs to divided among all input words. In the more common CBOW-with-average mode, the error can be applied in full to all words, it will still have the desired effect on the next forward-prop because of the averaging that occurs. I trust the source code from the original word2vec authors more than any reasoning in the 3rd-party Rong "word2vec Parameter Learning Explained" paper. Rong could easily have made the same error here that's confused others. I can't make it any more clear. Trust me that this matches the author's original intent, works well, and (further) is more consistent in its internal operation across multiple |
This paper suggests that the original implementation and, subsequently, gensim's is wrong. https://arxiv.org/pdf/2012.15332.pdf |
That is interesting, thanks for the link @spinkney ! Any appetite for trying this out in Gensim & benchmarking? I'm less interested in what is "right / wrong" and more in what works better: |
Yea, maybe, "wrong" is not the right word. As it clearly works, but not as well. |
@piskvorky @spinkney The preprint looks interesting, but it makes some dubious claims:
Gensim does normalize the gradient by the number of averaged input word and n-gram vectors.
That's not what I see in the code: Gensim normalizes the gradient when |
I am not yet convinced by the accuracy part of the preprint: Due to the issues I discuss above, I suspect the author used the non-default If the measurements are to be trusted, then this is a big issue with large corpora (am I done training in a week or in two months?) and one of the reasons why #2905 has been moving at snail's pace. I have no idea what the cause could be, but since SG is unaffected, it seems that the issue is in the low-level Cython CBOW functions from |
I've given the paper & their code a quick look, after seeing it on HN. You can see some of my comments there. I'm not yet convinced their analysis is correct or their change offers any net benefits. Their own benchmarks show mixed results, with small differences either way. It'd be interesting if @tmikolov had any comments on their interpretation. I'm not sure their method of testing Gensim always with an Given that they observed that the (seldom-used & I believe even removed from Certainly, a Looking at their code, it also appears they've completely eliminated the Their use of an 'alias method' for negative-sampling looks like a separate, functionally-neutral but performance-enhancing change. It could be a nice separate optimization. I don't fully understand it yet – I think it'd use more memory than our current 'binary-search-over-list-of-split-points' approach, but remains O(1) (ignoring cache-locality effects) for arbitrarily large vocabularies. (The original (There are likely other still low-hanging-fruit performance optimizations around some of the negative-sampling, reduced-windows, & frequent-word-downsampling code - such as using a faster RNG or simply reusing a smaller number of random-draws repeatedly. EG: just permute the values {1..window} once per epoch, then cycle through them for each center-word - it's not as random, but probably good enough. Something similar might work for negative-sampling or downsampling.) Their graph showing Gensim performance gains for more threads in SG but not CBOW doesn't quite match my expectations - I thought they both petered out in gains (& started getting worse) past about a dozen threads, unless using |
A possibly-updated (later 2021?) version of the "Corrected CBOW..." paper is at: https://aclanthology.org/2021.insights-1.1.pdf. After another quick read, I remain unconvinced that this is either a 'correction' or even a 'consistent improvement'. Firstly, as far as I'm concerned, the Mikolov (et al) implementations in As above & per the tables in the paper, note that their results still show Gensim performing better than their approach on a couple of their evaluations! Their improvements on other evaluations, with the same training data, aren't that big – & (as is sadly rampant in papers) it's not crystal-clear that they've given preexisting implementations (in this case Gensim) the same breadth of meta-parameter optimizations as their method. In particular, before claiming their approach is better, they should be sure to search the already-supported parameter options Gensim for values that may closely approximate the benefits of their changes, in particular:
Their speed claims are also clouded by the inclusion of another, independent optimization of the negative-sampling which doesn't change the distribution. As per my prior comment, it'd be easy enough to prepare a patch with an optional |
In
gensim/gensim/models/word2vec.py
, line 394 and line 401Shouldn't this be
if model.cbow_mean and input_word_indices
rather thanif not model.cbow_mean and input_word_indices
?The text was updated successfully, but these errors were encountered: