-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with Confusion Matrix normalization and DDP computation #2724
Comments
Hi! thanks for your contribution!, great first issue! |
Hi! UPD: As I understand, |
Yeah, just a moment 👍 |
@SkafteNicki As I remember, #3465 do not solve issue 4 |
🐛 Bug
I started using the ConfusionMatrix metric to compute normalized confusion matrices within a mult-GPU DDP environment. However, I found the following issues:
limit_val_batches
is small (such as when debugging).To Reproduce
As a means to reproduce these issues, I have attached two python scripts. I had to change the extension to .txt such that they would upload the Github. The script
confusion_matrix.py
exposes the first two issues with normalization within a single process. The scriptconfusion_matrix_ddp.py
exposes the final two issues by computing the metric within a model trained using DDP over two GPUs. For both scripts, they create a 4 class problem with 20 samples unevenly divided among the classes. The true confusion matrix is computed within the script and printed to standard out, along with the confusion matrix computed by Pytorch Lightning.confusion_matrix.py.txt
confusion_matrix_ddp.py.txt
Steps to reproduce the behavior:
.txt
extension./confusion_matrix.py
on a machine with at least 1 GPU. Compare the true and test confusion matrices../confusion_matrix_ddp.py
on a machine with at least 2 GPUs. This computes the unnormalized confusion matrix. Compare the true and test confusion matrices../confusion_matrix_ddp.py --normalize
on a machine with at least 2 GPUs. This computes the normalized confusion matrix. Compare the true and test confusion matrices.Expected behavior
The computed confusion matrix will be identical to the true confusion matrix printed within the scripts provided above.
Environment
1.5.1=py3.8_cuda10.2.89_cudnn7.6.5_0
conda
,pip
, source):conda
cudatoolkit=10.2.89=hfd86e86_1
Solution
I have forked Pytorch Lightning and created fixes for all of these issues: https://github.com/jpblackburn/pytorch-lightning/tree/bugfix/confusion_matrix. I am willing to turn this into a pull request.
The solution to the first two issues did not require any major changes. The third issue required the addition of a new argument to
ConfusionMatrix
for the number of classes. It is optional and the final argument, so that the API is backwards compatible. The fourth issue was most involved as it required modifyingConfusionMatrix
to derive fromMetric
rather thanTensorMetric
. It then uses an internal class that drives fromTensorMetric
. In this manner,forward
delegates the computation of the unnormalized confusion matrix and the DDP reduction to the internal class, performing the normalization itself after the DDP reduction is complete.I incrementally fixed the issues for easier understanding:
To validate the solution:
./confusion_matrix.py
on a machine with at least 1 GPU. Compare the true and test confusion matrices../confusion_matrix_ddp.py --set-num-classes
on a machine with at least 2 GPUs. Compare the true and test confusion matrices../confusion_matrix_ddp.py --set-num-classes --normalize
on a machine with at least 2 GPUs. Compare the true and test confusion matrices.Note the addition of
--set-num-classes
to the DDP script.The text was updated successfully, but these errors were encountered: