You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TorchSnapshot is a performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind. It includes many optimizations to control for memory usage and optimize checkpoint writing for DDP-style workloads over torch.save/torch.load. For more information, please check out the readme: https://github.com/pytorch/torchsnapshot#why-torchsnapshot
This could be a nice addition to Ignite, similar to the existing Checkpoint handler
The text was updated successfully, but these errors were encountered:
ananthsub
changed the title
Support TorchSnapshot for efficient checkpoint saving and loading
Support for TorchSnapshot for efficient checkpoint saving and loading
Oct 24, 2022
@ananthsub thanks for suggesting this feature! Let us get a bit familiar with torch snapshot and see how this can be integrated to ignite.
A question I have about the usage, in DDP user should call Snapshot.take by all ranks ? How about the path specified in the argument, where it should be, node 0, rank 0 ?
A question I have about the usage, in DDP user should call Snapshot.take by all ranks ?
Yes, Snapshot.take should always be called on all ranks in a distributed setting. It acts as a collective.
How about the path specified in the argument, where it should be, node 0, rank 0 ?
The path specified should be a directory, which should be the same across all ranks. If on a multi-node setting, this assumes you have a storage system visible by all nodes (e.g. a cloud storage object store)
🚀 Feature
TorchSnapshot is a performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind. It includes many optimizations to control for memory usage and optimize checkpoint writing for DDP-style workloads over torch.save/torch.load. For more information, please check out the readme: https://github.com/pytorch/torchsnapshot#why-torchsnapshot
This could be a nice addition to Ignite, similar to the existing Checkpoint handler
cc @yifuwang
The text was updated successfully, but these errors were encountered: