-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERROR when using multi-gpu training #18
Comments
I had a similar issue, make sure you're using PyTorch 1.12 as per the environment.yml file. |
I tried this code, when I set --nproc_per_node=1, the code works fine, but once --nproc_per_node>1 (e.g.,--nproc_per_node=2), this code doesn't work and reports the same error as in the picture, is there a solution for this? My torch version is 2.1 because I'm using H800GPU. |
@RachelTeamo to run on torch 2.1 replace line 89 in I couldn't find any info in the PyTorch docs warning about the change in DDP API but this solved the issue for me. |
Thanks for your suggestions, I replaced the code follow your suggestion. But the issue still exist. |
I solved this by ignoring line 79-84 # if dist.get_rank() == 0:
# with torch.no_grad():
# images = torch.zeros([batch_gpu, net.img_channels, net.img_resolution, net.img_resolution], device=device)
# sigma = torch.ones([batch_gpu], device=device)
# labels = torch.zeros([batch_gpu, net.label_dim], device=device)
# misc.print_module_summary(net, [images, sigma, labels], max_nesting=2) And set |
Hi, thanks for sharing your work. I can't train your model on my GPUS- two 4090. Is there any solution?
The text was updated successfully, but these errors were encountered: