Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting iterations in Checkpoint is wrong for global_step_transform #1148

Closed
amatsukawa opened this issue Jun 22, 2020 · 9 comments · Fixed by #1155
Closed

Getting iterations in Checkpoint is wrong for global_step_transform #1148

amatsukawa opened this issue Jun 22, 2020 · 9 comments · Fixed by #1155

Comments

@amatsukawa
Copy link
Contributor

amatsukawa commented Jun 22, 2020

There seems to be two bugs in Checkpoint related to global_step_transform when it's attached to the valid rather than train engine.

First, global_step_transform does a lookup based on the event fired. This causes issues when the handler is not attached to an {EPOCH/ITERATION}_COMPLETED, eg. when it's attached to COMPLETED on the valid engine as the docs suggest.

Second, global_step_transform is intended to give the "true" count (iteration, epoch, whatever it may be). As such, it should not only be used in the filename, but also as the priority. Right now, priority is the iteration count of the engine it's attached to, which again does not work for valid engine.

A third point, which isn't really a bug but more usability: Checkpoint silently drops checkpoints if it has checkpointed the same filename before. I think such occurrences are likely user error (or in my case, framework error, since my iteration count of valid engine is always the same at COMPLETED). Perhaps a warning log is warranted. Alternatively, if the checkpoint is truly the same, writing it again is idempotent, so perhaps this check should be removed entirely.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 22, 2020

@amatsukawa thanks for report ! Let me check and reproduce the first 2 points.

About the third point, it was raised here: #847 and can provide incoherent list of saved files inside Checkpoint._saved.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 22, 2020

@amatsukawa I'm trying to reproduce the 1st point you mentioned.
I followed the docs of Checkpoint, attached to COMPLETED: evaluator.add_event_handler(Events.COMPLETED, handler) and it works as expected, i'd say.

Let me show you my code and let's discuss where is the bug :

import ignite
print(ignite.__version__)

from unittest.mock import MagicMock

import torch
import torch.nn as nn

from ignite.handlers import DiskSaver, Checkpoint, global_step_from_engine
from ignite.engine import Engine, State, Events


save_handler = MagicMock(spec=BaseSaveHandler)

trainer = Engine(lambda e, b: None)
evaluator = Engine(lambda e, b: None)

acc_list = [0.3, 0.4, 0.5, 0.6, 0.5, 0.55, 0.61, 0.66, 0.7, 0.8]
acc_iter = iter(acc_list)


@evaluator.on(Events.EPOCH_COMPLETED)
def setup_result():
    evaluator.state.metrics["accuracy"] = next(acc_iter)


@trainer.on(Events.EPOCH_COMPLETED)
def run_eval():
    evaluator.run([0, 1, 2])


def score_function(engine):
    return engine.state.metrics['accuracy']


model = nn.Linear(1, 1)
to_save = {'model': model}
handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True), n_saved=2,
                     filename_prefix='best', score_function=score_function, score_name="val_acc",
                     global_step_transform=global_step_from_engine(trainer))

evaluator.add_event_handler(Events.COMPLETED, handler)

trainer.run([0, 1, 2], max_epochs=10)

this gives

ls /tmp/models
> 'best_model_10_val_acc=0.8000.pt'  'best_model_9_val_acc=0.7000.pt'

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 22, 2020

About 2nd point, if global_step_transform should define the priority. I was thinking about that recently and I'd say its Checkpoint's API docs are not enough clear and should be reworked, IMO.

As you know the priority is defined by either 1) attribute of attached event (Events.ITERATION_COMPLETED -> engine.state.iteration) or 2) by score_function.

First case "attribute of attached event" is training checkpointing use-case. In this case, yes, we can override the priority by the output of global_step_transform. For instance, it is engine's iteration. Overriding in some sense is redundant, at least for me.

Second case "score function" is about to store the best model. In most of the cases, checkpoint instance is attached to evaluation engine. The priority is defined by a score, but it is up to user to define what is the "score". It can be a metric value or a trainer's epoch/iteration etc. So, in this sense, global_step_transform just plays a role of adaptor to write correct trainer's progress number in the filename...

What do you think ?

@amatsukawa
Copy link
Contributor Author

amatsukawa commented Jun 22, 2020

Hi @vfdev-5 thanks for looking into this.

Re: second point, my use case is to attach a checkpointer, without a score function, onto the valid engine. The motivation is #966, to guarantee the checkpointer runs last. I suppose I could pass a score function that actually returns the trainer's iteration, but IMO it seems natural for global_step to serve this function; it's a little strange that it does that for filenames but not for priority.

Re: the first point, perhaps I am wrong, and I just wasn't seeing the iteration count change because of the point above. But isn't engine.last_event_name here COMPLETED? So how does the lookup for COMPLETED on the train engine here return the epoch count?

EDIT: ah, it's because COMPLETED maps to epoch:

Events.COMPLETED: "epoch",
. This doesn't work if I do to evaluation with a ITERATION_COMPLETED(every=N) and hook the checkpoint onto validation's COMPLETED, I think?

EDIT2: perhaps that's expected behavior and your definition of global_step. Maybe what I'm supposed to do is to use the custom_event_name for my example. In which case, sorry for the red herring. However, I think it could be clarified in the docs what exactly this returns, especially given its wide use fo this term in tensorflow to mean the iteration count.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 22, 2020

So how does the lookup for COMPLETED on the train engine here return the epoch count?

Yes, mapping is done via

event_to_attr = {
Events.GET_BATCH_STARTED: "iteration",
Events.GET_BATCH_COMPLETED: "iteration",
Events.ITERATION_STARTED: "iteration",
Events.ITERATION_COMPLETED: "iteration",
Events.EPOCH_STARTED: "epoch",
Events.EPOCH_COMPLETED: "epoch",
Events.STARTED: "epoch",
Events.COMPLETED: "epoch",
}

EDIT2: perhaps that's expected behavior and your definition of global_step, in which case, sorry for the red herring. However, I think it could be clarified in the docs what exactly this returns, especially given its wide use fo this term in tensorflow to mean the iteration count.

Yes, we need to improve the docs and clarify our terms, how and where to use each argument...

Re: the first point, perhaps I am wrong, and I just wasn't seeing the iteration count change because of the point above. But isn't engine.last_event_name here COMPLETED? So how does the lookup for COMPLETED on the train engine here return the epoch count?

Well, in the following code

global_step = self.global_step_transform(engine, engine.last_event_name)

we pass current engine (evaluator) and the event when it was called (e.g. COMPLETED) as arguments for some context. However, user can return anything unrelated to those args. For example, take a look at global_step_from_engine.

This doesn't work if I do to evaluation with a ITERATION_COMPLETED(every=N) and hook the checkpoint onto validation's COMPLETED, I think?

Normally, in master and 0.4.0, there is no more difference between ITERATION_COMPLETED or ITERATION_COMPLETED(every=N)...

EDIT:

Re: second point, my use case is to attach a checkpointer, without a score function, onto the valid engine. The motivation is #966, to guarantee the checkpointer runs last. I suppose I could pass a score function that actually returns the trainer's iteration, but IMO it seems natural for global_step to serve this function; it's a little strange that it does that for filenames but not for priority.

If we could a little bit better work out the use-case, maybe we can improve the API without current ambiguity on score_function and global_step_transform. I'd be happy to discuss more on that. Could you please, give a small snippet of what you are trying to achieve and what kind of checkpoint files you would like finally to obtain ? Thanks

@amatsukawa
Copy link
Contributor Author

amatsukawa commented Jun 22, 2020

I want to:

  • Attach to valid engine without a score fn.
  • Save every N iterations, rather than epochs.

Naively, I might just change the event on run_eval:

import ignite
print(ignite.__version__)

import os
import shutil
from unittest.mock import MagicMock

import torch
import torch.nn as nn

from ignite.handlers import DiskSaver, Checkpoint, global_step_from_engine
from ignite.handlers.checkpoint import BaseSaveHandler
from ignite.engine import Engine, State, Events


save_handler = MagicMock(spec=BaseSaveHandler)

trainer = Engine(lambda e, b: None)
evaluator = Engine(lambda e, b: None)


@evaluator.on(Events.EPOCH_COMPLETED)
def setup_result():
    evaluator.state.metrics["accuracy"] = 0.


@trainer.on(Events.ITERATION_COMPLETED(every=2))
def run_eval():
    evaluator.run([0, 1, 2])


def score_function(engine):
    return engine.state.metrics['accuracy']

shutil.rmtree('/tmp/models')

model = nn.Linear(1, 1)
to_save = {'trainer': trainer, 'model': model}
handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True),
                     filename_prefix='ckpt',
                     global_step_transform=global_step_from_engine(trainer))

evaluator.add_event_handler(Events.COMPLETED, handler)

trainer.run([0, 1, 2], max_epochs=10)

print(os.listdir('/tmp/models'))

with open(os.path.join('/tmp/models', handler.last_checkpoint), "rb") as f:
    ckpt = torch.load(f)
    print(ckpt["trainer"])

The above example results in this, ie. checkpoints are dropped due to point #2 of my first post.

0.4rc.0.post1
['ckpt_checkpoint_1.pt']
OrderedDict([('epoch_length', 3), ('max_epochs', 10), ('iteration', 2)])

After our discussion, it seems I need to do this for the handler instead at the moment:

handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True),
                     filename_prefix='ckpt',
                     score_name="iteration", 
                     score_function=lambda _: trainer.state.iteration, 
                     global_step_transform=global_step_from_engine(trainer))

This gives:

['ckpt_checkpoint_10_iteration=30.pt']
OrderedDict([('epoch_length', 3), ('max_epochs', 10), ('iteration', 30)])

Note the checkpoint_10, which is the epoch count. This is the discussion around COMPLETED and epoch count.

I can further change to this to get the correct checkpoints and checkpoint name:

handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True),
                     filename_prefix='ckpt',
                     score_name="iteration",
                     score_function=lambda _: trainer.state.iteration,
                     global_step_transform=global_step_from_engine(trainer, Events.ITERATION_COMPLETED))

IMHO what should happen is that:

  • I should not have to pass the score_name and score_function above, and global_step should be used. This can be solved by using global step as priority.
  • What exactly global_step_from_engine does should be clarified, ie. that it, in effect, swaps out the engine in question, but is still event dependent. From the name, I assumed this would just give the iteration count. I may have had this misconception because TF calls the iteration count global_step.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 22, 2020

Thanks for explicit example @amatsukawa !

So, ideally we would like to have the following code

handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True),
                     filename_prefix='ckpt',
                     global_step_transform=global_step_from_engine(trainer, Events.ITERATION_COMPLETED))

evaluator.add_event_handler(Events.COMPLETED, handler)

produce the output

[ ckpt_checkpoint_30, ]

I may have had this misconception because TF calls the iteration count global_step.

Yes, you are right about this. And if you are using Tensorboard, global step notion is more relaxed :)

@amatsukawa
Copy link
Contributor Author

amatsukawa commented Jun 22, 2020

Thanks for explicit example @amatsukawa !

Thanks for your attention on this!

So, ideally we would like to have the following code

handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True),
                     filename_prefix='ckpt',
                     global_step_transform=global_step_from_engine(trainer, Events.ITERATION_COMPLETED))

evaluator.add_event_handler(Events.COMPLETED, handler)

produce the output

[ ckpt_checkpoint_30, ]

Yup, that would be ideal. I think this would require that global_step be used as priority if it exists.

I may have had this misconception because TF calls the iteration count global_step.

Yes, you are right about this. And if you are using Tensorboard, global step notion is more relaxed :)

Perhaps something like: iteration_from_engine, epoch_from_engine, event_attr_from_engine would be more clear :)

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 22, 2020

We'll try to work on it. I put this issue into 0.4.1 project kanban for instance.

@amatsukawa, so just let me tell that if you would like to contribute to the project and send an initiail PR that we could work out further, please do not hesitate :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants