Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed bug in loss computation for Word2Vec with hierarchical softmax #3397

Merged
merged 2 commits into from
Nov 3, 2022

Conversation

TalIfargan
Copy link
Contributor

@TalIfargan TalIfargan commented Oct 31, 2022

I found an issue with loss computation for Word2Vec when using Skip-Gram (sg=1) with Hierarchical Softmax (hs=1).
While training different models and trying to understand their loss I have encountered the same problem that was described in this Why does the loss of Word2Vec model trained by gensim at first increase for a few epochs and then decrease? SO question. I guess that if I try to assign this PR to an open issue that will be #2617.
I decided not to elaborate too much here because the author of the SO question has given all the information needed in order to understand and reproduce the results and I guess that @gojomo is very aware to the different details already. In addition I am actually changing only 3 characters in the code and they will only affect very specific and limited component of the code, which is the loss calculation when using Word2Vec Skip-Gram (sg=1) with Hierarchical Softmax (hs=1) and setting compute_loss=True.
In short, the loss appears to behave in a weird and unpredictable way (essentially spiking up somewhere in the beginning of the training and not converging for different vector sizes as expected) but the word vectors are improving over time according to different score metrics I have tested. I've plotted the loss relative to epoch for different vector sizes in order to show what I mean.

image

After some investigation of the code I have found that in the loss calculation there was a redundant -1 on line 129 on word2vec_inner.pyx file:
https://github.com/RaRe-Technologies/gensim/blob/a435f24fe25e17f473e71af3468660512c2606cb/gensim/models/word2vec_inner.pyx#L127-L133
This -1 does not align with the computation of the loss that was mentioned in word2vec Parameter Learning Explained
.
After getting rid of this -1 I was able to get the expected results from the loss under the exact same conditions when running the previous experiments with the problematic results.

image

@TalIfargan TalIfargan changed the title fixed loss computation for sg, hs bug in loss computation for Word2Vec with skip-gram and hierarchical softmax Oct 31, 2022
@gojomo
Copy link
Collaborator

gojomo commented Nov 1, 2022

Those are much-more realistic loss-over-time curves, so on that basis alone, I suspect this is a good fix for that code-branch. Thanks!

But, we'd certainly want to fix anywhere the same issue recurs - & it looks like the cbow_hs path has the same problem, and perhaps the sg_neg and cbow_neg paths have never had a similar problem.

Can you run a similar set of 'before' tests using alternate params sg=0, hs=1, negative=0, and sg=1, hs=0, negative=5, and sg=0, hs=0, negative=5, just to see if, in fact, (1) CBOW+HS shows same initial oddness, and (2) neither SG+NEG nor CBOW+NEG show the problem?

Then, try your fix also at the w2v_fast_sentence_cbow_hs() path (~ line 330), and see if it similarly improves behavior-over-epochs in CBOW+HS? Then we'd have a 2-line patch, but be sure the easy cases have been fixed.

(Running at just a smaller number of more-typical vector_size values, say {32, 128, 512}, should be enough to see the problem & resolution if you're pressed for running time. And similarly, no more than 20 epochs needed, as long as that shows the general curve.)

@TalIfargan
Copy link
Contributor Author

I have some good news, it looks like the same problem was also in w2v_fast_sentence_cbow_hs() and was solved by changing the calculation like I did for w2v_fast_sentence_sg_hs(). In addition the loss calculation for SG+NEG and CBOW+NEG seems to be fine as is. Relevant plots:

CBOW + HS before change:

sg_0_hs_0_negative_0

CBOW + HS after change:

sg_0_hs_1_negative_0

SG+NEG:

sg_1_hs_0_negative_5

CBOW+NEG:

sg_0_hs_0_negative_5

I committed the additional change for the CBOW + HS.

@gojomo
Copy link
Collaborator

gojomo commented Nov 2, 2022

Thank you! Assuming automated tests-in-progress pass (& on the off chance they fail, it's probably due to something other than your tiny changes), these fixes look good for integration to me.

@piskvorky
Copy link
Owner

piskvorky commented Nov 2, 2022

Many thanks! One small change for code, a giant leap for loss correctness.

I do wonder how the -1 got there though. It definitely looks like a deliberate choice, not a typo.

@TalIfargan TalIfargan changed the title bug in loss computation for Word2Vec with skip-gram and hierarchical softmax Fixed bug in loss computation for Word2Vec with hierarchical softmax Nov 2, 2022
@mpenkov mpenkov merged commit c93eb0b into piskvorky:develop Nov 3, 2022
@mpenkov
Copy link
Collaborator

mpenkov commented Nov 3, 2022

Looks good to me! Thank you for your work and the well-crafted PR @TalIfargan.

@TalIfargan TalIfargan deleted the word2vec_loss_fix branch November 3, 2022 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants