-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.save with ddp accelerator throwing RuntimeError: Tensors must be CUDA and dense #8227
Comments
Hi @shrinath-suresh thanks for reporting this. @Borda @awaelchli @tchaton Is this related to (or maybe fixed by) ongoing changes with torch metrics? |
Could be yes. But one important note about your code: You would probably want to do this: state_dict = model.state_dict()
if trainer.global_rank == 0:
torch.save(state_dict, "model.pl") cc @tchaton |
@awaelchli Still seeing same problem with PTL=1.3.7 and metrics installed from fix_metric branch. Further the system hangs if --gpus is set to value > 1 `$ pip freeze | grep torch pytorch-lightning==1.3.7 |
Yes, can reproduce. Your temporary workaround until we fix the issue in torchmetrics/Lightning: As per my previous comment, better (regardless of this issue report): model.to(trainer.accelerator.root_device) # temp fix for torchmetrics sync issue
state_dict = model.state_dict()
if trainer.global_rank == 0:
torch.save(state_dict, "iris.pt") |
@tchaton we need to find a solution for |
Thanks @awaelchli The workaround works for 1 gpu, but training is still hanging if gpus > 1 |
Yes sorry, forgot to say one has to use PL master or PL 1.3.8 which will be released tomorrow. #8218 |
See Lightning-AI/torchmetrics#334 for the fix |
@awaelchli Same error for gpu > 2 even on the master branch. Training hangs and does not exit. NCCL_DEBUG logs attached.
|
@chauhang I'm realizing now that you are actually not the OP who reported the issue. My statements are all for the original author and I'm using their code and their command to verify that it works with ddp gpus>2. Please open a separate issue with your problem and code. Thank you for your understanding. |
@awaelchli We are working on the same project and using the same code things are not working on gpus > 2 for DDP. @shrinath-suresh will work with @tchaton on the PyTorch Slack channel to get the issues resolved for our project |
Hey @chauhang @shrinath-suresh, There is a fix on this branch for TorchMetrics: Lightning-AI/torchmetrics#339. Mind giving it a try. Best, |
Tested with the fix branch. Both single gpu + ddp
and multi gpu + ddp
are working as expected @tchaton any insights on this warning ?
|
@shrinath-suresh I added device_ids in #8165 and this warning will disappear. It only shows for torch > 1.8. |
@tchaton @awaelchli Will this ddp fix be part of 1.3.8 pytorch lightning release ? |
device_ids #8165 ddp fix is in 1.3.8 yes, released today. The Lightning-AI/torchmetrics#339 is in the rc release it looks like: https://github.com/PyTorchLightning/metrics/releases |
Dear @shrinath-suresh, And the TorchMetrics fixed as been included in v0.4.1rc0. Best, |
sure, Thanks |
🐛 Bug
Model saving using
torch.save
not working with ddp accelerator.To Reproduce
https://github.com/mlflow/mlflow-torchserve/blob/master/examples/IrisClassification/iris_classification.py
The above mentioned example trains the Iris Classification model.
Dependent packages:
Run the example using the following command
python iris_classification.py --max_epochs 30 --gpus 1 --accelerator ddp
Produces the following error while saving the model
The same script was working for us till 1.2.7. To reproduce
Install pytorch lightning 1.2.7 -
pip install pytorch-lightning==1.2.7
and run the same command againpython iris_classification.py --max_epochs 30 --gpus 1 --accelerator ddp
Now, the model trains and pt file is saved successfully.
Attaching both the logs with NCCL_DEBUG set to INFO for reference
ptl_model_save_success_1.2.7.txt
ptl_model_save_failure_1.3.7.txt
Expected behavior
Iris classification model trains successfully and the pt file is generated
Environment
conda
,pip
, source): piptorch.__config__.show()
:Additional context
Also tried
torch.save(trainer.get_model(), "iris.pt")
. In Pytorch Lightning 1.3.7 - the same error is getting shown.The text was updated successfully, but these errors were encountered: