You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I converted my model training from DataParallel to FSDP based on PyTorch FSDP documents and an example in (https://github.com/pytorch/examples/blob/main/distributed/FSDP). I made change to model wrapping and distributed data sampling and training (and save and load checkpoints), but I did not make change to how loss calculation, which I think consistence with the documents and the example. However, I faced the issue that FSDP training requires more memory than DataParallel one. Does anyone know how this could happen? I found the thread #633 mentioning updating PyTorch FSDP documents on buffer but I cannot find such updated documents. Could anyone point me to the documents or does anyone know how to properly setup FSDP for CNN so that memory usage is correctly reduced compared to DataParallel? Thank you very much for your help!
The text was updated successfully, but these errors were encountered:
Hi All,
I converted my model training from DataParallel to FSDP based on PyTorch FSDP documents and an example in (https://github.com/pytorch/examples/blob/main/distributed/FSDP). I made change to model wrapping and distributed data sampling and training (and save and load checkpoints), but I did not make change to how loss calculation, which I think consistence with the documents and the example. However, I faced the issue that FSDP training requires more memory than DataParallel one. Does anyone know how this could happen? I found the thread #633 mentioning updating PyTorch FSDP documents on buffer but I cannot find such updated documents. Could anyone point me to the documents or does anyone know how to properly setup FSDP for CNN so that memory usage is correctly reduced compared to DataParallel? Thank you very much for your help!
The text was updated successfully, but these errors were encountered: