torch.save with ddp accelerator throwing RuntimeError: Tensors must be CUDA and dense #8227

shrinath-suresh · 2021-06-30T16:59:12Z

🐛 Bug

Model saving using torch.save not working with ddp accelerator.

To Reproduce

https://github.com/mlflow/mlflow-torchserve/blob/master/examples/IrisClassification/iris_classification.py

The above mentioned example trains the Iris Classification model.

Dependent packages:

torch==1.9.0
torchvision==0.10.0
sklearn
pytorch lightning 1.3.7

Run the example using the following command

python iris_classification.py --max_epochs 30 --gpus 1 --accelerator ddp

Produces the following error while saving the model

--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/mlflow-torchserve/examples/IrisClassification/iris_classification.py", line 127, in <module>
    torch.save(model.state_dict(), "iris.pt")
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1259, in state_dict
    module.state_dict(destination, prefix + name + '.', keep_vars=keep_vars)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 421, in state_dict
    with self.sync_context(dist_sync_fn=self.dist_sync_fn):
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 299, in sync_context
    cache = self.sync(
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 272, in sync
    self._sync_dist(dist_sync_fn, process_group=process_group)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 213, in _sync_dist
    output_dict = apply_to_collection(
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 195, in apply_to_collection
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 195, in <dictcomp>
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 191, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/distributed.py", line 124, in gather_all_tensors
    return _simple_gather_all_tensors(result, group, world_size)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/distributed.py", line 94, in _simple_gather_all_tensors
    torch.distributed.all_gather(gathered_result, result, group)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
    work = group.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense

The same script was working for us till 1.2.7. To reproduce

Install pytorch lightning 1.2.7 - pip install pytorch-lightning==1.2.7 and run the same command again

python iris_classification.py --max_epochs 30 --gpus 1 --accelerator ddp

Now, the model trains and pt file is saved successfully.

Attaching both the logs with NCCL_DEBUG set to INFO for reference
ptl_model_save_success_1.2.7.txt
ptl_model_save_failure_1.3.7.txt

Expected behavior

Iris classification model trains successfully and the pt file is generated

Environment

CUDA:
- GPU:
  - Tesla K80
  - Tesla K80
  - Tesla K80
  - Tesla K80
  - Tesla K80
  - Tesla K80
  - Tesla K80
  - Tesla K80
- available: True
- version: 10.2
Packages:
- numpy: 1.21.0
- pyTorch_debug: False
- pyTorch_version: 1.9.0+cu102
- pytorch-lightning: 1.3.7
- tqdm: 4.61.1
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.9.5
- version: AttributeError: 'TTNamespace' object has no attribute 'drop_prob' #30~18.04.1-Ubuntu SMP Tue Oct 20 11:09:25 UTC 2020

How you installed PyTorch (conda, pip, source): pip
If compiling from source, the output of torch.__config__.show():
Any other relevant information:

Additional context

Also tried torch.save(trainer.get_model(), "iris.pt"). In Pytorch Lightning 1.3.7 - the same error is getting shown.

The text was updated successfully, but these errors were encountered:

ethanwharris · 2021-06-30T17:11:29Z

Hi @shrinath-suresh thanks for reporting this. @Borda @awaelchli @tchaton Is this related to (or maybe fixed by) ongoing changes with torch metrics?

awaelchli · 2021-06-30T18:20:57Z

Could be yes.
Let's try to run this example against the PR Lightning-AI/torchmetrics#334

But one important note about your code:
You have torch.save() at the end of the script. When using ddp here, you will ask each process to save the object to the same file on the filesystem. Chances are high that you will run into problems there because multiple processes can't save to the same file simultaneously.

You would probably want to do this:

state_dict = model.state_dict()
if trainer.global_rank == 0:
    torch.save(state_dict, "model.pl")

cc @tchaton

chauhang · 2021-06-30T18:44:28Z

@awaelchli Still seeing same problem with PTL=1.3.7 and metrics installed from fix_metric branch. Further the system hangs if --gpus is set to value > 1

`$ pip freeze | grep torch

pytorch-lightning==1.3.7
torch==1.9.0
torchaudio==0.9.0a0+33b2469
-e git+https://github.com/PyTorchLightning/metrics.git@f720bb2a4f3a70ce6b4ec5dd0ce2ccf16c785a8b#egg=torchmetrics
torchvision==0.10.0`

awaelchli · 2021-06-30T19:45:25Z

Yes, can reproduce.
The issue comes from the newly introduced sync function in torchmetrics, where it is using torch.distirbuted.all_gather. But this does not work if the model is on CPU and the distributed backend is initialized for GPU.

Your temporary workaround until we fix the issue in torchmetrics/Lightning:
torch.save(model.to(trainer.accelerator.root_device).state_dict(), "iris.pt")

As per my previous comment, better (regardless of this issue report):

model.to(trainer.accelerator.root_device) # temp fix for torchmetrics sync issue
state_dict = model.state_dict()
if trainer.global_rank == 0:
    torch.save(state_dict, "iris.pt")

awaelchli · 2021-06-30T19:46:16Z

@tchaton we need to find a solution for all_gather in the sync function, as it will only work if the module is on the correct device.

chauhang · 2021-06-30T21:16:24Z

Thanks @awaelchli The workaround works for 1 gpu, but training is still hanging if gpus > 1

ddp-4gpu.txt

awaelchli · 2021-06-30T21:22:46Z

Yes sorry, forgot to say one has to use PL master or PL 1.3.8 which will be released tomorrow. #8218

edenlightning · 2021-06-30T22:46:52Z

See Lightning-AI/torchmetrics#334 for the fix

chauhang · 2021-07-01T00:34:22Z

@awaelchli Same error for gpu > 2 even on the master branch. Training hangs and does not exit. NCCL_DEBUG logs attached.

ddp-4gpu-nccl-debug.txt

$ pip freeze | grep torch
pytorch-lightning @ https://github.com/PyTorchLightning/pytorch-lightning/archive/master.zip
torch==1.9.0
torchaudio==0.9.0a0+33b2469
-e git+https://github.com/PyTorchLightning/metrics.git@4067585aba8416a2817f15aefeecc4332a8ef138#egg=torchmetrics
torchvision==0.10.0

awaelchli · 2021-07-01T00:57:23Z

@chauhang I'm realizing now that you are actually not the OP who reported the issue. My statements are all for the original author and I'm using their code and their command to verify that it works with ddp gpus>2.

Please open a separate issue with your problem and code. Thank you for your understanding.

chauhang · 2021-07-01T05:45:01Z

@awaelchli We are working on the same project and using the same code things are not working on gpus > 2 for DDP. @shrinath-suresh will work with @tchaton on the PyTorch Slack channel to get the issues resolved for our project

tchaton · 2021-07-01T08:35:21Z

Hey @chauhang @shrinath-suresh,

There is a fix on this branch for TorchMetrics: Lightning-AI/torchmetrics#339.

Mind giving it a try.

Best,
T.C

shrinath-suresh · 2021-07-01T10:51:58Z

Hey @chauhang @shrinath-suresh,

There is a fix on this branch for TorchMetrics: PyTorchLightning/metrics#339.

Mind giving it a try.

Best,
T.C

Tested with the fix branch.

Both single gpu + ddp

python iris_classification.py --max_epochs 10 --gpus 1 --accelerator ddp

and multi gpu + ddp

python iris_classification.py --max_epochs 10 --gpus 2 --accelerator ddp

are working as expected
iris_classification_multi_gpu_ddp.txt
iris_classification_single_gpu_ddp.txt

@tchaton any insights on this warning ?

[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

awaelchli · 2021-07-01T11:55:47Z

@shrinath-suresh I added device_ids in #8165 and this warning will disappear. It only shows for torch > 1.8.
Let me know if that helps.

shrinath-suresh · 2021-07-01T15:46:31Z

@shrinath-suresh I added device_ids in #8165 and this warning will disappear. It only shows for torch > 1.8.
Let me know if that helps.

@awaelchli @tchaton

Hey @chauhang @shrinath-suresh,

There is a fix on this branch for TorchMetrics: PyTorchLightning/metrics#339.

Mind giving it a try.

Best,
T.C

@tchaton @awaelchli Will this ddp fix be part of 1.3.8 pytorch lightning release ?

awaelchli · 2021-07-01T20:56:54Z

device_ids #8165 ddp fix is in 1.3.8 yes, released today.

The Lightning-AI/torchmetrics#339 is in the rc release it looks like: https://github.com/PyTorchLightning/metrics/releases

tchaton · 2021-07-05T20:10:31Z

Dear @shrinath-suresh,

And the TorchMetrics fixed as been included in v0.4.1rc0.
Do you have any more questions or can we close this issue ?

Best,
T.C

shrinath-suresh · 2021-07-06T16:20:27Z

Dear @shrinath-suresh,

And the TorchMetrics fixed as been included in v0.4.1rc0.
Do you have any more questions or can we close this issue ?

Best,
T.C

sure, Thanks

shrinath-suresh added bug Something isn't working help wanted Open to be worked on labels Jun 30, 2021

ethanwharris added the priority: 0 High priority task label Jun 30, 2021

ethanwharris mentioned this issue Jun 30, 2021

Metrics with list state cause crash using ddp Lightning-AI/torchmetrics#337

Closed

edenlightning added metrics and removed help wanted Open to be worked on labels Jun 30, 2021

edenlightning assigned tchaton Jun 30, 2021

tchaton mentioned this issue Jul 1, 2021

Add is_sync logic to Metric Lightning-AI/torchmetrics#339

Merged

4 tasks

edenlightning added this to the v1.3.x milestone Jul 1, 2021

Borda modified the milestones: v1.3.x, v1.4 Jul 6, 2021

edenlightning closed this as completed Jul 6, 2021

shrinath-suresh mentioned this issue Jul 12, 2021

added explain for AGNews Bert classification example mlflow/mlflow-torchserve#67

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.save with ddp accelerator throwing RuntimeError: Tensors must be CUDA and dense #8227

torch.save with ddp accelerator throwing RuntimeError: Tensors must be CUDA and dense #8227

shrinath-suresh commented Jun 30, 2021

ethanwharris commented Jun 30, 2021 •

edited

Loading

awaelchli commented Jun 30, 2021

chauhang commented Jun 30, 2021

awaelchli commented Jun 30, 2021 •

edited

Loading

awaelchli commented Jun 30, 2021

chauhang commented Jun 30, 2021

awaelchli commented Jun 30, 2021 •

edited

Loading

edenlightning commented Jun 30, 2021

chauhang commented Jul 1, 2021 •

edited

Loading

awaelchli commented Jul 1, 2021

chauhang commented Jul 1, 2021

tchaton commented Jul 1, 2021 •

edited

Loading

shrinath-suresh commented Jul 1, 2021

awaelchli commented Jul 1, 2021

shrinath-suresh commented Jul 1, 2021

awaelchli commented Jul 1, 2021

tchaton commented Jul 5, 2021

shrinath-suresh commented Jul 6, 2021

torch.save with ddp accelerator throwing RuntimeError: Tensors must be CUDA and dense #8227

torch.save with ddp accelerator throwing RuntimeError: Tensors must be CUDA and dense #8227

Comments

shrinath-suresh commented Jun 30, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

ethanwharris commented Jun 30, 2021 • edited Loading

awaelchli commented Jun 30, 2021

chauhang commented Jun 30, 2021

awaelchli commented Jun 30, 2021 • edited Loading

awaelchli commented Jun 30, 2021

chauhang commented Jun 30, 2021

awaelchli commented Jun 30, 2021 • edited Loading

edenlightning commented Jun 30, 2021

chauhang commented Jul 1, 2021 • edited Loading

awaelchli commented Jul 1, 2021

chauhang commented Jul 1, 2021

tchaton commented Jul 1, 2021 • edited Loading

shrinath-suresh commented Jul 1, 2021

awaelchli commented Jul 1, 2021

shrinath-suresh commented Jul 1, 2021

awaelchli commented Jul 1, 2021

tchaton commented Jul 5, 2021

shrinath-suresh commented Jul 6, 2021

ethanwharris commented Jun 30, 2021 •

edited

Loading

awaelchli commented Jun 30, 2021 •

edited

Loading

awaelchli commented Jun 30, 2021 •

edited

Loading

chauhang commented Jul 1, 2021 •

edited

Loading

tchaton commented Jul 1, 2021 •

edited

Loading