-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric with ddp spawn causes script to hang after Trainer.fit() #331
Comments
Hi! thanks for your contribution!, great first issue! |
I will try to git bisect against torchmetrics master to find the commit. |
thanks @awaelchli for bring it up! I'm using Lightning + torchmetric and encountered the same issue for ddp. cc: @tchaton , @maximsch2 |
@awaelchli could you try the fix in PR #338 and see if that fixes the problem? |
It was fixed on Lightning side Lightning-AI/pytorch-lightning#8218 to account for the changes in torchmetrics state_dict saving and loading. |
Btw #339 is in parallel reworking the state_dict logic, which also resolves the problem here. |
Closing as #339 solved this. |
🐛 Bug
Adding a torchmetric as an attribute to the model causes processes to hang when launching with ddp spawn.
To Reproduce
Code sample
A few important notes:
self.metric = Accuracy()
, then it will break. You don't have to use the metric, just assign it.Expected behavior
No hang.
Environment
conda
,pip
, source): pipAdditional context
trying to prepare PL patch release: Lightning-AI/pytorch-lightning#8198
The text was updated successfully, but these errors were encountered: