Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log epoch real time in LNNP #231

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

RaulPPelaez
Copy link
Collaborator

This PR changes the LNNP module so that the real time since the start is logged each epoch.
Allows to track training time with the CSVLogger.

@RaulPPelaez
Copy link
Collaborator Author

@AntonioMirarchi please review

@AntonioMirarchi
Copy link
Contributor

It works fine. This is the time column form the metrics.csv:

0      4.779309
1      6.532793
2      8.925880
3     10.662247
4     13.062045
5     14.766587
6     17.175591
7     18.913141
8     21.361235
9     23.063034
10          NaN
Name: time, dtype: float64

The Nan on the last row is not clear to me

@AntonioMirarchi
Copy link
Contributor

I think the Nan is due to reaching the max num epochs by the trainer, so it's not starting a new epoch but it's creating a row in the metrics regardless of it.

@RaulPPelaez
Copy link
Collaborator Author

I am only logging the time in on_validation_epoch_end and on_test_epoch_end. Perhaps there is some other place where data is logged?

@RaulPPelaez
Copy link
Collaborator Author

Antonio can you give me a yaml to reproduce this NaN thing you see? I cannot trigger it.

@RaulPPelaez
Copy link
Collaborator Author

I think the NaN Antonio reports comes from the test pass at the end of a train. It is considered a new "training" and for some reason the first time has NaN.

@RaulPPelaez
Copy link
Collaborator Author

Tried to change the logging of the epoch to integer, but this produces a warning:

 You called `self.log('epoch', ...)` in your `on_validation_epoch_end` but the value needs to be floating to be reduced. Converting it to torch.float32. You can silence this warning by converting the value to floating point yourself. If you don't intend to reduce the value (for instance when logging the global step or epoch) then you can use `self.logger.log_metrics({'epoch': ...})` instead.

There is some (really unconvincing imo) discussion on why one cannot log an integer Lightning-AI/pytorch-lightning#18739
The alternative suggested in the warning itself does not work as one would expect, you get a new line in metrics.csv in which everything is empty except epoch.
cc @peastman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants