Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Confusion Matrix normalization and DDP computation #2724

Closed
jpblackburn opened this issue Jul 27, 2020 · 7 comments · Fixed by #3465
Closed

Issues with Confusion Matrix normalization and DDP computation #2724

jpblackburn opened this issue Jul 27, 2020 · 7 comments · Fixed by #3465
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on

Comments

@jpblackburn
Copy link

jpblackburn commented Jul 27, 2020

🐛 Bug

I started using the ConfusionMatrix metric to compute normalized confusion matrices within a mult-GPU DDP environment. However, I found the following issues:

  1. The normalization divisor is computed correctly in a row-wise manner; however, the division is applied column-wise.
  2. The normalization does not protect against divide by zero if there is no data in a particular row. While this is not a usual case for a well-designed validation set, it is possible when you have a large number of unbalanced classes and limit_val_batches is small (such as when debugging).
  3. There is no way to specify the number of classes for the confusion matrix. This is critical when performing DDP reduction as it is possible that the automatic computation of the number of classes could produce different answers for each process. I encountered this possibility when using a large number of unbalanced classes such that one of the DDP processes did not see any true data or declarations of the last class, causing its number of classes to be one less than for the other processes. This bug sometimes causes the entire training process to hang such that I needed to manually kill each DDP process.
  4. When computing a normalized confusion matrix with DDP reduction, the sum reduction needs to happen prior to the normalization.

To Reproduce

As a means to reproduce these issues, I have attached two python scripts. I had to change the extension to .txt such that they would upload the Github. The script confusion_matrix.py exposes the first two issues with normalization within a single process. The script confusion_matrix_ddp.py exposes the final two issues by computing the metric within a model trained using DDP over two GPUs. For both scripts, they create a 4 class problem with 20 samples unevenly divided among the classes. The true confusion matrix is computed within the script and printed to standard out, along with the confusion matrix computed by Pytorch Lightning.

confusion_matrix.py.txt
confusion_matrix_ddp.py.txt

Steps to reproduce the behavior:

  1. Download both scripts
  2. Rename them to remove the .txt extension
  3. ./confusion_matrix.py on a machine with at least 1 GPU. Compare the true and test confusion matrices.
  4. ./confusion_matrix_ddp.py on a machine with at least 2 GPUs. This computes the unnormalized confusion matrix. Compare the true and test confusion matrices.
    • WARNING: This may hang the process and require you to manually kill each process.
  5. ./confusion_matrix_ddp.py --normalize on a machine with at least 2 GPUs. This computes the normalized confusion matrix. Compare the true and test confusion matrices.
    • WARNING: This may hang the process and require you to manually kill each process.

Expected behavior

The computed confusion matrix will be identical to the true confusion matrix printed within the scripts provided above.

Environment

  • PyTorch Version (e.g., 1.0): 1.5.1=py3.8_cuda10.2.89_cudnn7.6.5_0
  • OS (e.g., Linux): CentOS
  • How you installed PyTorch (conda, pip, source): conda
  • Build command you used (if compiling from source): NA
  • Python version: 3.8.3
  • CUDA/cuDNN version: Conda cudatoolkit=10.2.89=hfd86e86_1
  • GPU models and configuration: GeForce GTX 1070 and GeForce GTX 970

Solution

I have forked Pytorch Lightning and created fixes for all of these issues: https://github.com/jpblackburn/pytorch-lightning/tree/bugfix/confusion_matrix. I am willing to turn this into a pull request.

The solution to the first two issues did not require any major changes. The third issue required the addition of a new argument to ConfusionMatrix for the number of classes. It is optional and the final argument, so that the API is backwards compatible. The fourth issue was most involved as it required modifying ConfusionMatrix to derive from Metric rather than TensorMetric. It then uses an internal class that drives from TensorMetric. In this manner, forward delegates the computation of the unnormalized confusion matrix and the DDP reduction to the internal class, performing the normalization itself after the DDP reduction is complete.

I incrementally fixed the issues for easier understanding:

To validate the solution:

  1. ./confusion_matrix.py on a machine with at least 1 GPU. Compare the true and test confusion matrices.
  2. ./confusion_matrix_ddp.py --set-num-classes on a machine with at least 2 GPUs. Compare the true and test confusion matrices.
  3. ./confusion_matrix_ddp.py --set-num-classes --normalize on a machine with at least 2 GPUs. Compare the true and test confusion matrices.

Note the addition of --set-num-classes to the DDP script.

@jpblackburn jpblackburn added bug Something isn't working help wanted Open to be worked on labels Jul 27, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@c00k1ez
Copy link
Contributor

c00k1ez commented Sep 12, 2020

Hi!
Issue 3 already fixed at #3450 and Issue 2 under process at #3465. I'm interested in ConfusionMatrix fix for DDP mode too, so maybe we can collaborate to apply your changes for modern version of source code?

UPD: As I understand, TensorMetric for DDP already fixed at #2528, so there is no need to change it for Metric

@SkafteNicki
Copy link
Member

@c00k1ez could you also take care of issue 1 in your #3465 PR?
Should be a simple change from cm/cm.sum(-1) to cm/cm.sum(-1, keepdim=True) in the normalization.

@c00k1ez
Copy link
Contributor

c00k1ez commented Sep 12, 2020

Yeah, just a moment 👍

@c00k1ez
Copy link
Contributor

c00k1ez commented Sep 12, 2020

Fix Issue 1 at #3465 (commit fd19c78).

@c00k1ez
Copy link
Contributor

c00k1ez commented Sep 14, 2020

@SkafteNicki As I remember, #3465 do not solve issue 4

@SkafteNicki
Copy link
Member

@c00k1ez you are correct, it was auto closed when #3465 was merged. However, I do have a fix for this, hope to do it soon (write to me on slack if you want to take over)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants