Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.save with ddp accelerator throwing RuntimeError: Tensors must be CUDA and dense #8227

Closed
shrinath-suresh opened this issue Jun 30, 2021 · 18 comments
Assignees
Labels
bug Something isn't working priority: 0 High priority task
Milestone

Comments

@shrinath-suresh
Copy link

🐛 Bug

Model saving using torch.save not working with ddp accelerator.

To Reproduce

https://github.com/mlflow/mlflow-torchserve/blob/master/examples/IrisClassification/iris_classification.py

The above mentioned example trains the Iris Classification model.

Dependent packages:

torch==1.9.0
torchvision==0.10.0
sklearn
pytorch lightning 1.3.7

Run the example using the following command

python iris_classification.py --max_epochs 30 --gpus 1 --accelerator ddp

Produces the following error while saving the model

--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/mlflow-torchserve/examples/IrisClassification/iris_classification.py", line 127, in <module>
    torch.save(model.state_dict(), "iris.pt")
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1259, in state_dict
    module.state_dict(destination, prefix + name + '.', keep_vars=keep_vars)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 421, in state_dict
    with self.sync_context(dist_sync_fn=self.dist_sync_fn):
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 299, in sync_context
    cache = self.sync(
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 272, in sync
    self._sync_dist(dist_sync_fn, process_group=process_group)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 213, in _sync_dist
    output_dict = apply_to_collection(
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 195, in apply_to_collection
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 195, in <dictcomp>
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 191, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/distributed.py", line 124, in gather_all_tensors
    return _simple_gather_all_tensors(result, group, world_size)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/distributed.py", line 94, in _simple_gather_all_tensors
    torch.distributed.all_gather(gathered_result, result, group)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
    work = group.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense

The same script was working for us till 1.2.7. To reproduce

Install pytorch lightning 1.2.7 - pip install pytorch-lightning==1.2.7 and run the same command again

python iris_classification.py --max_epochs 30 --gpus 1 --accelerator ddp

Now, the model trains and pt file is saved successfully.

Attaching both the logs with NCCL_DEBUG set to INFO for reference
ptl_model_save_success_1.2.7.txt
ptl_model_save_failure_1.3.7.txt

Expected behavior

Iris classification model trains successfully and the pt file is generated

Environment

  • CUDA:
    • GPU:
      • Tesla K80
      • Tesla K80
      • Tesla K80
      • Tesla K80
      • Tesla K80
      • Tesla K80
      • Tesla K80
      • Tesla K80
    • available: True
    • version: 10.2
  • Packages:
    • numpy: 1.21.0
    • pyTorch_debug: False
    • pyTorch_version: 1.9.0+cu102
    • pytorch-lightning: 1.3.7
    • tqdm: 4.61.1
  • System:
  • How you installed PyTorch (conda, pip, source): pip
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

Also tried torch.save(trainer.get_model(), "iris.pt"). In Pytorch Lightning 1.3.7 - the same error is getting shown.

@shrinath-suresh shrinath-suresh added bug Something isn't working help wanted Open to be worked on labels Jun 30, 2021
@ethanwharris ethanwharris added the priority: 0 High priority task label Jun 30, 2021
@ethanwharris
Copy link
Member

ethanwharris commented Jun 30, 2021

Hi @shrinath-suresh thanks for reporting this. @Borda @awaelchli @tchaton Is this related to (or maybe fixed by) ongoing changes with torch metrics?

@awaelchli
Copy link
Contributor

Could be yes.
Let's try to run this example against the PR Lightning-AI/torchmetrics#334

But one important note about your code:
You have torch.save() at the end of the script. When using ddp here, you will ask each process to save the object to the same file on the filesystem. Chances are high that you will run into problems there because multiple processes can't save to the same file simultaneously.

You would probably want to do this:

state_dict = model.state_dict()
if trainer.global_rank == 0:
    torch.save(state_dict, "model.pl")

cc @tchaton

@chauhang
Copy link

@awaelchli Still seeing same problem with PTL=1.3.7 and metrics installed from fix_metric branch. Further the system hangs if --gpus is set to value > 1

`$ pip freeze | grep torch

pytorch-lightning==1.3.7
torch==1.9.0
torchaudio==0.9.0a0+33b2469
-e git+https://github.com/PyTorchLightning/metrics.git@f720bb2a4f3a70ce6b4ec5dd0ce2ccf16c785a8b#egg=torchmetrics
torchvision==0.10.0`

@awaelchli
Copy link
Contributor

awaelchli commented Jun 30, 2021

Yes, can reproduce.
The issue comes from the newly introduced sync function in torchmetrics, where it is using torch.distirbuted.all_gather. But this does not work if the model is on CPU and the distributed backend is initialized for GPU.

Your temporary workaround until we fix the issue in torchmetrics/Lightning:
torch.save(model.to(trainer.accelerator.root_device).state_dict(), "iris.pt")

As per my previous comment, better (regardless of this issue report):

model.to(trainer.accelerator.root_device) # temp fix for torchmetrics sync issue
state_dict = model.state_dict()
if trainer.global_rank == 0:
    torch.save(state_dict, "iris.pt")

@awaelchli
Copy link
Contributor

@tchaton we need to find a solution for all_gather in the sync function, as it will only work if the module is on the correct device.

@chauhang
Copy link

Thanks @awaelchli The workaround works for 1 gpu, but training is still hanging if gpus > 1

ddp-4gpu.txt

@awaelchli
Copy link
Contributor

awaelchli commented Jun 30, 2021

Yes sorry, forgot to say one has to use PL master or PL 1.3.8 which will be released tomorrow. #8218

@edenlightning
Copy link
Contributor

See Lightning-AI/torchmetrics#334 for the fix

@edenlightning edenlightning added metrics and removed help wanted Open to be worked on labels Jun 30, 2021
@chauhang
Copy link

chauhang commented Jul 1, 2021

@awaelchli Same error for gpu > 2 even on the master branch. Training hangs and does not exit. NCCL_DEBUG logs attached.

ddp-4gpu-nccl-debug.txt

$ pip freeze | grep torch
pytorch-lightning @ https://github.com/PyTorchLightning/pytorch-lightning/archive/master.zip
torch==1.9.0
torchaudio==0.9.0a0+33b2469
-e git+https://github.com/PyTorchLightning/metrics.git@4067585aba8416a2817f15aefeecc4332a8ef138#egg=torchmetrics
torchvision==0.10.0

@awaelchli
Copy link
Contributor

@chauhang I'm realizing now that you are actually not the OP who reported the issue. My statements are all for the original author and I'm using their code and their command to verify that it works with ddp gpus>2.

Please open a separate issue with your problem and code. Thank you for your understanding.

@chauhang
Copy link

chauhang commented Jul 1, 2021

@awaelchli We are working on the same project and using the same code things are not working on gpus > 2 for DDP. @shrinath-suresh will work with @tchaton on the PyTorch Slack channel to get the issues resolved for our project

@tchaton
Copy link
Contributor

tchaton commented Jul 1, 2021

Hey @chauhang @shrinath-suresh,

There is a fix on this branch for TorchMetrics: Lightning-AI/torchmetrics#339.

Mind giving it a try.

Best,
T.C

@shrinath-suresh
Copy link
Author

Hey @chauhang @shrinath-suresh,

There is a fix on this branch for TorchMetrics: PyTorchLightning/metrics#339.

Mind giving it a try.

Best,
T.C

Tested with the fix branch.

Both single gpu + ddp

python iris_classification.py --max_epochs 10 --gpus 1 --accelerator ddp

and multi gpu + ddp

python iris_classification.py --max_epochs 10 --gpus 2 --accelerator ddp

are working as expected
iris_classification_multi_gpu_ddp.txt
iris_classification_single_gpu_ddp.txt

@tchaton any insights on this warning ?

[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

@awaelchli
Copy link
Contributor

@shrinath-suresh I added device_ids in #8165 and this warning will disappear. It only shows for torch > 1.8.
Let me know if that helps.

@shrinath-suresh
Copy link
Author

@shrinath-suresh I added device_ids in #8165 and this warning will disappear. It only shows for torch > 1.8.
Let me know if that helps.

@awaelchli @tchaton

Hey @chauhang @shrinath-suresh,

There is a fix on this branch for TorchMetrics: PyTorchLightning/metrics#339.

Mind giving it a try.

Best,
T.C

@tchaton @awaelchli Will this ddp fix be part of 1.3.8 pytorch lightning release ?

@edenlightning edenlightning added this to the v1.3.x milestone Jul 1, 2021
@awaelchli
Copy link
Contributor

device_ids #8165 ddp fix is in 1.3.8 yes, released today.

The Lightning-AI/torchmetrics#339 is in the rc release it looks like: https://github.com/PyTorchLightning/metrics/releases

@tchaton
Copy link
Contributor

tchaton commented Jul 5, 2021

Dear @shrinath-suresh,

And the TorchMetrics fixed as been included in v0.4.1rc0.
Do you have any more questions or can we close this issue ?

Best,
T.C

@shrinath-suresh
Copy link
Author

Dear @shrinath-suresh,

And the TorchMetrics fixed as been included in v0.4.1rc0.
Do you have any more questions or can we close this issue ?

Best,
T.C

sure, Thanks

@Borda Borda modified the milestones: v1.3.x, v1.4 Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: 0 High priority task
Projects
None yet
Development

No branches or pull requests

7 participants