Fixed bug in loss computation for Word2Vec with hierarchical softmax #3397

TalIfargan · 2022-10-31T19:48:38Z

I found an issue with loss computation for Word2Vec when using Skip-Gram (sg=1) with Hierarchical Softmax (hs=1).
While training different models and trying to understand their loss I have encountered the same problem that was described in this Why does the loss of Word2Vec model trained by gensim at first increase for a few epochs and then decrease? SO question. I guess that if I try to assign this PR to an open issue that will be #2617.
I decided not to elaborate too much here because the author of the SO question has given all the information needed in order to understand and reproduce the results and I guess that @gojomo is very aware to the different details already. In addition I am actually changing only 3 characters in the code and they will only affect very specific and limited component of the code, which is the loss calculation when using Word2Vec Skip-Gram (sg=1) with Hierarchical Softmax (hs=1) and setting compute_loss=True.
In short, the loss appears to behave in a weird and unpredictable way (essentially spiking up somewhere in the beginning of the training and not converging for different vector sizes as expected) but the word vectors are improving over time according to different score metrics I have tested. I've plotted the loss relative to epoch for different vector sizes in order to show what I mean.

After some investigation of the code I have found that in the loss calculation there was a redundant -1 on line 129 on word2vec_inner.pyx file:
https://github.com/RaRe-Technologies/gensim/blob/a435f24fe25e17f473e71af3468660512c2606cb/gensim/models/word2vec_inner.pyx#L127-L133
This -1 does not align with the computation of the loss that was mentioned in word2vec Parameter Learning Explained
.
After getting rid of this -1 I was able to get the expected results from the loss under the exact same conditions when running the previous experiments with the problematic results.

gojomo · 2022-11-01T02:54:56Z

Those are much-more realistic loss-over-time curves, so on that basis alone, I suspect this is a good fix for that code-branch. Thanks!

But, we'd certainly want to fix anywhere the same issue recurs - & it looks like the cbow_hs path has the same problem, and perhaps the sg_neg and cbow_neg paths have never had a similar problem.

Can you run a similar set of 'before' tests using alternate params sg=0, hs=1, negative=0, and sg=1, hs=0, negative=5, and sg=0, hs=0, negative=5, just to see if, in fact, (1) CBOW+HS shows same initial oddness, and (2) neither SG+NEG nor CBOW+NEG show the problem?

Then, try your fix also at the w2v_fast_sentence_cbow_hs() path (~ line 330), and see if it similarly improves behavior-over-epochs in CBOW+HS? Then we'd have a 2-line patch, but be sure the easy cases have been fixed.

(Running at just a smaller number of more-typical vector_size values, say {32, 128, 512}, should be enough to see the problem & resolution if you're pressed for running time. And similarly, no more than 20 epochs needed, as long as that shows the general curve.)

TalIfargan · 2022-11-01T17:23:26Z

I have some good news, it looks like the same problem was also in w2v_fast_sentence_cbow_hs() and was solved by changing the calculation like I did for w2v_fast_sentence_sg_hs(). In addition the loss calculation for SG+NEG and CBOW+NEG seems to be fine as is. Relevant plots:

CBOW + HS before change:

CBOW + HS after change:

SG+NEG:

CBOW+NEG:

I committed the additional change for the CBOW + HS.

gojomo · 2022-11-02T04:49:01Z

Thank you! Assuming automated tests-in-progress pass (& on the off chance they fail, it's probably due to something other than your tiny changes), these fixes look good for integration to me.

piskvorky · 2022-11-02T07:31:58Z

Many thanks! One small change for code, a giant leap for loss correctness.

I do wonder how the -1 got there though. It definitely looks like a deliberate choice, not a typo.

mpenkov · 2022-11-03T13:50:20Z

Looks good to me! Thank you for your work and the well-crafted PR @TalIfargan.

fixed loss computation for sg, hs

f3ae9aa

TalIfargan changed the title ~~fixed loss computation for sg, hs~~ bug in loss computation for Word2Vec with skip-gram and hierarchical softmax Oct 31, 2022

fixed loss computation for cbow, hs

7c3e6d0

piskvorky approved these changes Nov 2, 2022

View reviewed changes

piskvorky requested a review from mpenkov November 2, 2022 07:32

TalIfargan changed the title ~~bug in loss computation for Word2Vec with skip-gram and hierarchical softmax~~ Fixed bug in loss computation for Word2Vec with hierarchical softmax Nov 2, 2022

mpenkov approved these changes Nov 3, 2022

View reviewed changes

mpenkov merged commit c93eb0b into piskvorky:develop Nov 3, 2022

TalIfargan deleted the word2vec_loss_fix branch November 3, 2022 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed bug in loss computation for Word2Vec with hierarchical softmax #3397

Fixed bug in loss computation for Word2Vec with hierarchical softmax #3397

TalIfargan commented Oct 31, 2022 •

edited

Loading

gojomo commented Nov 1, 2022

TalIfargan commented Nov 1, 2022

gojomo commented Nov 2, 2022

piskvorky commented Nov 2, 2022 •

edited

Loading

mpenkov commented Nov 3, 2022

Fixed bug in loss computation for Word2Vec with hierarchical softmax #3397

Fixed bug in loss computation for Word2Vec with hierarchical softmax #3397

Conversation

TalIfargan commented Oct 31, 2022 • edited Loading

gojomo commented Nov 1, 2022

TalIfargan commented Nov 1, 2022

CBOW + HS before change:

CBOW + HS after change:

SG+NEG:

CBOW+NEG:

gojomo commented Nov 2, 2022

piskvorky commented Nov 2, 2022 • edited Loading

mpenkov commented Nov 3, 2022

TalIfargan commented Oct 31, 2022 •

edited

Loading

piskvorky commented Nov 2, 2022 •

edited

Loading