track training loss while using doc2vec issue. #2983

skwolvie · 2020-10-18T14:18:58Z

Problem description

I am trying to track training loss using doc2vec algorithm. And it failed. Is there a way to track training loss in doc2vec?
Also, I didnt find any documentation related to performing early stopping while do2vec training phase?

the similarity score is varying a lot based on epochs, and I want to stop training when it has reached optimal capacity with callbacks. I have used keras, it has earlystopping feature. Not sure how to do it using gensim models.

Any response is appreciated. Thank you!

Steps/code/corpus to reproduce

class EpochLogger(CallbackAny2Vec):
    '''Callback to log information about training'''

    def __init__(self):
        self.epoch = 0

    def on_epoch_begin(self, model):
        print("Epoch #{} start".format(self.epoch))

    def on_epoch_end(self, model):
        print("Epoch #{} end".format(self.epoch))
        self.epoch += 1

epoch_logger = EpochLogger()

class LossLogger(CallbackAny2Vec):
    '''Output loss at each epoch'''
    def __init__(self):
        self.epoch = 1
        self.losses = []

    def on_epoch_begin(self, model):
        print(f'Epoch: {self.epoch}', end='\t')

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        self.losses.append(loss)
        print(f'  Loss: {loss}')
        self.epoch += 1

loss_logger = LossLogger()

def train_model(data, ids, destination, alpha):

    print('\tTagging data .. ')
    tagged_data = [TaggedDocument(words=word_tokenize(str(_d).lower()), tags=[str(ids[i])]) for i, _d in enumerate(data)]

    print('\tPreparing model with the following parameters: epochs = {}, vector_size = {}, alpha = {} .. '.
          format(max_epochs, vec_size, alpha))

    model = Doc2Vec(vector_size=vec_size,
                    workers=cores//2,
                    alpha=alpha,  # initial learning rate
                    min_count=2,  # Ignore words having a total frequency below this
                    dm_mean=1,  # take mean of of word2vec and doc2vec
                    dm=1,
                    callbacks=[epoch_logger, loss_logger])  # PV-DM over PV-DBOW

    model.build_vocab(tagged_data, keep_raw_vocab=False, progress_per=100000)

Versions

Please provide the output of:

2017 4673
        Tagging data ..
        Preparing model with the following parameters: epochs = 50, vector_size = 100, alpha = 0.01 ..
        Beginning model training ..
                Iteration 0
                Learning Rate =  0.01
Epoch #0 start
Epoch: 1        Epoch #0 end
Traceback (most recent call last):
    loss = model.get_latest_training_loss()
AttributeError: 'Doc2Vec' object has no attribute 'get_latest_training_loss'

The text was updated successfully, but these errors were encountered:

gojomo · 2020-10-19T17:01:29Z

Loss-tallying has never yet been implemented for Gensim's Doc2Vec model (see pending open issue #2617), and is pretty sketchy in the only place (Word2Vec) where it is implemented (#2735, #2743) - including odd behavior (rising reporting loss in otherwise-apparently-effective training, mismatch with rough magnitudes of similar loss-reporting from Facebook's FastText) that might be indicative of further undiagnosed bugs.

So, there's not yet reliable hooks for early-stopping in any of the *2Vec models.

skwolvie · 2020-10-20T11:57:05Z

Loss-tallying has never yet been implemented for Gensim's Doc2Vec model (see pending open issue #2617), and is pretty sketchy in the only place (Word2Vec) where it is implemented (#2735, #2743) - including odd behavior (rising reporting loss in otherwise-apparently-effective training, mismatch with rough magnitudes of similar loss-reporting from Facebook's FastText) that might be indicative of further undiagnosed bugs.

So, there's not yet reliable hooks for early-stopping in any of the *2Vec models.

Then How do I choose the best model? Should I just blindly train the model for a lot of Epochs than just standard 20 epochs 5 iterations? Will that give any better results?

Do you happen to know 2Vec models by libraries other than gensim that can do this?

gojomo · 2020-10-20T19:03:42Z

The internal loss can't tell you what's the best model for a downstream purpose, only that the model isn't benefitting on its internal goals from further training. (A model settling at a lower internal loss may be worse, for some outside purpose, than one settling at a higher internal loss.) So, a lot of trial-and-error – though perhaps assisted with automated parameter search – is involved in picking the best model.

(When you say "standard 20 epochs 5 iterations", I suspect you might be making a common training mistake, since those usually shouldn't be separate value. But your code excerpt doesn't show your call(s) to .train() so I'm not sure what you're doing. See https://stackoverflow.com/questions/62801052/my-doc2vec-code-after-many-loops-of-training-isnt-giving-good-results-what-m for more info.)

I don't know of any library offering loss-reporting from a Doc2Vec implementation, but I'm not familiar with all the implementations, especially in non-Python languages.

griff4692 · 2021-06-28T01:17:13Z

is there a solution here? I am getting a training loss of 0 for every epoch and after the first epoch, the results are pretty nice but after the second, they are terrible. Yet it's a black box and I have no ability to monitor the loss. thoughts? is there another implementation of Word2Vec outside of gensim?

gojomo · 2021-06-28T17:57:06Z

@griff4692 - There are other word2vec options; but I'm not familiar with an alternate Python implementation of the "Paragraph Vectors" algorithm (aka Doc2Vec), much less one with loss-reporting.

If results after one epoch are good, but after more epochs are bad, there are probably other serious errors in your code which would need to be reviewed to be discovered. (That is: improvising an early-stop via loss-monitoring is probably the wrong fix.) See for example this SO answer about some really-misguided code that's unfortunately very common in oft-mimicked low-quality online examples:

https://stackoverflow.com/questions/62801052/my-doc2vec-code-after-many-loops-of-training-isnt-giving-good-results-what-m

Real improvement to the loss-tracking in Gensim's Doc2Vec (& other 2Vec models) is awaiting a code contributor who can rationalize/fix the many gaps and problems as outlined in #2617 & related issues like this one. But to discuss other workaround that might help in your case, I'd suggest posting more details to the project discussion list (https://groups.google.com/forum/#!forum/gensim) or a detailed SO question.

gojomo mentioned this issue Jun 28, 2021

Doc2Vec loss always showing 0 #3183

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

track training loss while using doc2vec issue. #2983

track training loss while using doc2vec issue. #2983

skwolvie commented Oct 18, 2020

gojomo commented Oct 19, 2020

skwolvie commented Oct 20, 2020 •

edited

Loading

gojomo commented Oct 20, 2020

griff4692 commented Jun 28, 2021

gojomo commented Jun 28, 2021 •

edited

Loading

track training loss while using doc2vec issue. #2983

track training loss while using doc2vec issue. #2983

Comments

skwolvie commented Oct 18, 2020

Problem description

Steps/code/corpus to reproduce

Versions

gojomo commented Oct 19, 2020

skwolvie commented Oct 20, 2020 • edited Loading

gojomo commented Oct 20, 2020

griff4692 commented Jun 28, 2021

gojomo commented Jun 28, 2021 • edited Loading

skwolvie commented Oct 20, 2020 •

edited

Loading

gojomo commented Jun 28, 2021 •

edited

Loading