Issues with Confusion Matrix normalization and DDP computation #2724

jpblackburn · 2020-07-27T14:32:27Z

🐛 Bug

I started using the ConfusionMatrix metric to compute normalized confusion matrices within a mult-GPU DDP environment. However, I found the following issues:

The normalization divisor is computed correctly in a row-wise manner; however, the division is applied column-wise.
The normalization does not protect against divide by zero if there is no data in a particular row. While this is not a usual case for a well-designed validation set, it is possible when you have a large number of unbalanced classes and limit_val_batches is small (such as when debugging).
There is no way to specify the number of classes for the confusion matrix. This is critical when performing DDP reduction as it is possible that the automatic computation of the number of classes could produce different answers for each process. I encountered this possibility when using a large number of unbalanced classes such that one of the DDP processes did not see any true data or declarations of the last class, causing its number of classes to be one less than for the other processes. This bug sometimes causes the entire training process to hang such that I needed to manually kill each DDP process.
When computing a normalized confusion matrix with DDP reduction, the sum reduction needs to happen prior to the normalization.

To Reproduce

As a means to reproduce these issues, I have attached two python scripts. I had to change the extension to .txt such that they would upload the Github. The script confusion_matrix.py exposes the first two issues with normalization within a single process. The script confusion_matrix_ddp.py exposes the final two issues by computing the metric within a model trained using DDP over two GPUs. For both scripts, they create a 4 class problem with 20 samples unevenly divided among the classes. The true confusion matrix is computed within the script and printed to standard out, along with the confusion matrix computed by Pytorch Lightning.

confusion_matrix.py.txt
confusion_matrix_ddp.py.txt

Steps to reproduce the behavior:

Download both scripts
Rename them to remove the .txt extension
./confusion_matrix.py on a machine with at least 1 GPU. Compare the true and test confusion matrices.
./confusion_matrix_ddp.py on a machine with at least 2 GPUs. This computes the unnormalized confusion matrix. Compare the true and test confusion matrices.
- WARNING: This may hang the process and require you to manually kill each process.
./confusion_matrix_ddp.py --normalize on a machine with at least 2 GPUs. This computes the normalized confusion matrix. Compare the true and test confusion matrices.
- WARNING: This may hang the process and require you to manually kill each process.

Expected behavior

The computed confusion matrix will be identical to the true confusion matrix printed within the scripts provided above.

Environment

PyTorch Version (e.g., 1.0): 1.5.1=py3.8_cuda10.2.89_cudnn7.6.5_0
OS (e.g., Linux): CentOS
How you installed PyTorch (conda, pip, source): conda
Build command you used (if compiling from source): NA
Python version: 3.8.3
CUDA/cuDNN version: Conda cudatoolkit=10.2.89=hfd86e86_1
GPU models and configuration: GeForce GTX 1070 and GeForce GTX 970

Solution

I have forked Pytorch Lightning and created fixes for all of these issues: https://github.com/jpblackburn/pytorch-lightning/tree/bugfix/confusion_matrix. I am willing to turn this into a pull request.

The solution to the first two issues did not require any major changes. The third issue required the addition of a new argument to ConfusionMatrix for the number of classes. It is optional and the final argument, so that the API is backwards compatible. The fourth issue was most involved as it required modifying ConfusionMatrix to derive from Metric rather than TensorMetric. It then uses an internal class that drives from TensorMetric. In this manner, forward delegates the computation of the unnormalized confusion matrix and the DDP reduction to the internal class, performing the normalization itself after the DDP reduction is complete.

I incrementally fixed the issues for easier understanding:

Issues 1 and 2: 8b8b635
Issue 3: 83927a9 and 913c1e3
Issue 4: e3e0743

To validate the solution:

./confusion_matrix.py on a machine with at least 1 GPU. Compare the true and test confusion matrices.
./confusion_matrix_ddp.py --set-num-classes on a machine with at least 2 GPUs. Compare the true and test confusion matrices.
./confusion_matrix_ddp.py --set-num-classes --normalize on a machine with at least 2 GPUs. Compare the true and test confusion matrices.

Note the addition of --set-num-classes to the DDP script.

The text was updated successfully, but these errors were encountered:

github-actions · 2020-07-27T14:33:21Z

Hi! thanks for your contribution!, great first issue!

c00k1ez · 2020-09-12T09:26:27Z

Hi!
Issue 3 already fixed at #3450 and Issue 2 under process at #3465. I'm interested in ConfusionMatrix fix for DDP mode too, so maybe we can collaborate to apply your changes for modern version of source code?

UPD: As I understand, TensorMetric for DDP already fixed at #2528, so there is no need to change it for Metric

SkafteNicki · 2020-09-12T10:46:12Z

@c00k1ez could you also take care of issue 1 in your #3465 PR?
Should be a simple change from cm/cm.sum(-1) to cm/cm.sum(-1, keepdim=True) in the normalization.

c00k1ez · 2020-09-12T10:56:14Z

Yeah, just a moment 👍

c00k1ez · 2020-09-12T11:26:04Z

Fix Issue 1 at #3465 (commit fd19c78).

c00k1ez · 2020-09-14T09:31:50Z

@SkafteNicki As I remember, #3465 do not solve issue 4

SkafteNicki · 2020-09-14T10:45:40Z

@c00k1ez you are correct, it was auto closed when #3465 was merged. However, I do have a fix for this, hope to do it soon (write to me on slack if you want to take over)

jpblackburn added bug Something isn't working help wanted Open to be worked on labels Jul 27, 2020

remisphere mentioned this issue Jul 29, 2020

Correctly using IoU and ConfusionMatrix #2753

Closed

2 tasks

edenlightning added distributed Generic distributed-related topic Metrics labels Jul 29, 2020

edenlightning assigned edenlightning and justusschock Jul 29, 2020

c00k1ez mentioned this issue Sep 12, 2020

fix normalize mode at confusion matrix (replace nans with zeros) #3465

Merged

7 tasks

SkafteNicki closed this as completed in #3465 Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Confusion Matrix normalization and DDP computation #2724

Issues with Confusion Matrix normalization and DDP computation #2724

jpblackburn commented Jul 27, 2020 •

edited

Loading

github-actions bot commented Jul 27, 2020

c00k1ez commented Sep 12, 2020 •

edited

Loading

SkafteNicki commented Sep 12, 2020

c00k1ez commented Sep 12, 2020

c00k1ez commented Sep 12, 2020

c00k1ez commented Sep 14, 2020

SkafteNicki commented Sep 14, 2020

Issues with Confusion Matrix normalization and DDP computation #2724

Issues with Confusion Matrix normalization and DDP computation #2724

Comments

jpblackburn commented Jul 27, 2020 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Solution

To validate the solution:

github-actions bot commented Jul 27, 2020

c00k1ez commented Sep 12, 2020 • edited Loading

SkafteNicki commented Sep 12, 2020

c00k1ez commented Sep 12, 2020

c00k1ez commented Sep 12, 2020

c00k1ez commented Sep 14, 2020

SkafteNicki commented Sep 14, 2020

jpblackburn commented Jul 27, 2020 •

edited

Loading

c00k1ez commented Sep 12, 2020 •

edited

Loading