-
-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting iterations in Checkpoint is wrong for global_step_transform
#1148
Comments
@amatsukawa thanks for report ! Let me check and reproduce the first 2 points. About the third point, it was raised here: #847 and can provide incoherent list of saved files inside |
@amatsukawa I'm trying to reproduce the 1st point you mentioned. Let me show you my code and let's discuss where is the bug : import ignite
print(ignite.__version__)
from unittest.mock import MagicMock
import torch
import torch.nn as nn
from ignite.handlers import DiskSaver, Checkpoint, global_step_from_engine
from ignite.engine import Engine, State, Events
save_handler = MagicMock(spec=BaseSaveHandler)
trainer = Engine(lambda e, b: None)
evaluator = Engine(lambda e, b: None)
acc_list = [0.3, 0.4, 0.5, 0.6, 0.5, 0.55, 0.61, 0.66, 0.7, 0.8]
acc_iter = iter(acc_list)
@evaluator.on(Events.EPOCH_COMPLETED)
def setup_result():
evaluator.state.metrics["accuracy"] = next(acc_iter)
@trainer.on(Events.EPOCH_COMPLETED)
def run_eval():
evaluator.run([0, 1, 2])
def score_function(engine):
return engine.state.metrics['accuracy']
model = nn.Linear(1, 1)
to_save = {'model': model}
handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True), n_saved=2,
filename_prefix='best', score_function=score_function, score_name="val_acc",
global_step_transform=global_step_from_engine(trainer))
evaluator.add_event_handler(Events.COMPLETED, handler)
trainer.run([0, 1, 2], max_epochs=10) this gives
|
About 2nd point, if As you know the priority is defined by either 1) attribute of attached event ( First case "attribute of attached event" is training checkpointing use-case. In this case, yes, we can override the priority by the output of Second case "score function" is about to store the best model. In most of the cases, checkpoint instance is attached to evaluation engine. The priority is defined by a score, but it is up to user to define what is the "score". It can be a metric value or a trainer's epoch/iteration etc. So, in this sense, What do you think ? |
Hi @vfdev-5 thanks for looking into this. Re: second point, my use case is to attach a checkpointer, without a score function, onto the valid engine. The motivation is #966, to guarantee the checkpointer runs last. I suppose I could pass a score function that actually returns the trainer's iteration, but IMO it seems natural for global_step to serve this function; it's a little strange that it does that for filenames but not for priority. Re: the first point, perhaps I am wrong, and I just wasn't seeing the iteration count change because of the point above. But isn't EDIT: ah, it's because ignite/ignite/engine/events.py Line 307 in de4c80f
ITERATION_COMPLETED(every=N) and hook the checkpoint onto validation's COMPLETED , I think?
EDIT2: perhaps that's expected behavior and your definition of |
Yes, mapping is done via ignite/ignite/engine/events.py Lines 299 to 308 in de4c80f
Yes, we need to improve the docs and clarify our terms, how and where to use each argument...
Well, in the following code global_step = self.global_step_transform(engine, engine.last_event_name) we pass current engine (evaluator) and the event when it was called (e.g.
Normally, in master and 0.4.0, there is no more difference between EDIT:
If we could a little bit better work out the use-case, maybe we can improve the API without current ambiguity on |
I want to:
Naively, I might just change the event on import ignite
print(ignite.__version__)
import os
import shutil
from unittest.mock import MagicMock
import torch
import torch.nn as nn
from ignite.handlers import DiskSaver, Checkpoint, global_step_from_engine
from ignite.handlers.checkpoint import BaseSaveHandler
from ignite.engine import Engine, State, Events
save_handler = MagicMock(spec=BaseSaveHandler)
trainer = Engine(lambda e, b: None)
evaluator = Engine(lambda e, b: None)
@evaluator.on(Events.EPOCH_COMPLETED)
def setup_result():
evaluator.state.metrics["accuracy"] = 0.
@trainer.on(Events.ITERATION_COMPLETED(every=2))
def run_eval():
evaluator.run([0, 1, 2])
def score_function(engine):
return engine.state.metrics['accuracy']
shutil.rmtree('/tmp/models')
model = nn.Linear(1, 1)
to_save = {'trainer': trainer, 'model': model}
handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True),
filename_prefix='ckpt',
global_step_transform=global_step_from_engine(trainer))
evaluator.add_event_handler(Events.COMPLETED, handler)
trainer.run([0, 1, 2], max_epochs=10)
print(os.listdir('/tmp/models'))
with open(os.path.join('/tmp/models', handler.last_checkpoint), "rb") as f:
ckpt = torch.load(f)
print(ckpt["trainer"]) The above example results in this, ie. checkpoints are dropped due to point #2 of my first post.
After our discussion, it seems I need to do this for the handler instead at the moment: handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True),
filename_prefix='ckpt',
score_name="iteration",
score_function=lambda _: trainer.state.iteration,
global_step_transform=global_step_from_engine(trainer)) This gives:
Note the I can further change to this to get the correct checkpoints and checkpoint name: handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True),
filename_prefix='ckpt',
score_name="iteration",
score_function=lambda _: trainer.state.iteration,
global_step_transform=global_step_from_engine(trainer, Events.ITERATION_COMPLETED)) IMHO what should happen is that:
|
Thanks for explicit example @amatsukawa ! So, ideally we would like to have the following code handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True),
filename_prefix='ckpt',
global_step_transform=global_step_from_engine(trainer, Events.ITERATION_COMPLETED))
evaluator.add_event_handler(Events.COMPLETED, handler) produce the output
Yes, you are right about this. And if you are using Tensorboard, global step notion is more relaxed :) |
Thanks for your attention on this!
Yup, that would be ideal. I think this would require that
Perhaps something like: |
We'll try to work on it. I put this issue into 0.4.1 project kanban for instance. @amatsukawa, so just let me tell that if you would like to contribute to the project and send an initiail PR that we could work out further, please do not hesitate :) |
There seems to be two bugs in
Checkpoint
related toglobal_step_transform
when it's attached to the valid rather than train engine.First,
global_step_transform
does a lookup based on the event fired. This causes issues when the handler is not attached to an{EPOCH/ITERATION}_COMPLETED
, eg. when it's attached toCOMPLETED
on the valid engine as the docs suggest.Second,
global_step_transform
is intended to give the "true" count (iteration, epoch, whatever it may be). As such, it should not only be used in the filename, but also as thepriority
. Right now, priority is the iteration count of the engine it's attached to, which again does not work for valid engine.A third point, which isn't really a bug but more usability:
Checkpoint
silently drops checkpoints if it has checkpointed the same filename before. I think such occurrences are likely user error (or in my case, framework error, since my iteration count of valid engine is always the same atCOMPLETED
). Perhaps a warning log is warranted. Alternatively, if the checkpoint is truly the same, writing it again is idempotent, so perhaps this check should be removed entirely.The text was updated successfully, but these errors were encountered: