This repository has been archived by the owner on Dec 20, 2024. It is now read-only.
fix: remove saving of metadata for training ckpt #190
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR would remove the saving of the metadata from the training checkpoint (it is still present in the inference ckpt). I am not sure if there is a reason we need it in both checkpoints?
This PR came out of the AIFSv03 training where we had quite a bit of pain restarting runs with the zip bug for checkpoints over a certain size. We had to use the work around
zip -d last.ckpt archive/anemoi-metadata/ai-models.json
many times, which this PR would get around. Very happy to be told we need this metadata in the training checkpoints! But if not this would avoid quite a bit of pain when resuming runs. 🙏