Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with IoU #543

Closed
TheCodez opened this issue Jun 6, 2019 · 4 comments · Fixed by #572
Closed

Problem with IoU #543

TheCodez opened this issue Jun 6, 2019 · 4 comments · Fixed by #572

Comments

@TheCodez
Copy link
Contributor

TheCodez commented Jun 6, 2019

I'm training a FCN on the Cityscapes dataset. All ignored classes are mapped to 255. This works perfectly fine for the loss function using ignore_index.

Using the Ignite IoU metric however results in this error:

/pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [36,0,0], thread: [0,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.

triggered here:

y_pred_ohe = to_onehot(indices.reshape(-1), self.num_classes)
  File "/usr/local/lib/python3.6/dist-packages/ignite/utils.py", line 48, in to_onehot
    onehot = torch.zeros(indices.shape[0], num_classes, *indices.shape[1:], device=indices.device)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:26

It's probably because there are only 19 classes and some values are 255.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 6, 2019

@TheCodez thanks for the report! Yes, that's can be a problem if we would like to ignore an index that is not contigous... In the docs I think we mentioned that https://pytorch.org/ignite/master/metrics.html#ignite.metrics.ConfusionMatrix

But I agree that such flexibility could be helpful

@TheCodez
Copy link
Contributor Author

TheCodez commented Jun 7, 2019

@vfdev-5 it seems that the torchvision implementation doesn't have this limitation.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 7, 2019

@TheCodez yes, I saw that they compute it too, but didn't inspect in details. It seems that if target is outside num_classes it is ignored as we may wish but there is no explicit definition of ignored indices... Maybe we could produce something in between... If you would like to send a PR, you're welcome !

@TheCodez
Copy link
Contributor Author

TheCodez commented Jun 7, 2019

@vfdev-5 I will create a pull request for it 👍

vfdev-5 added a commit to vfdev-5/ignite that referenced this issue Aug 2, 2019
Previous CM implementation suffered from the problem if target contains non-contiguous indices.
New implementation is almost taken from torchvision's https://github.com/pytorch/vision/blob/master/references/segmentation/utils.py#L75-L117

This commit also removes the case of targets as (batchsize, num_categories, ...) where num_categories excludes background class.
Confusion matrix computation is possible almost similarly for (batchsize, ...), but when target is all zero (0, ..., 0)  = no classes (background class),
then confusion matrix does not count any true/false predictions.
vfdev-5 added a commit to vfdev-5/ignite that referenced this issue Aug 2, 2019
Previous CM implementation suffered from the problem if target contains non-contiguous indices.
New implementation is almost taken from torchvision's https://github.com/pytorch/vision/blob/master/references/segmentation/utils.py#L75-L117

This commit also removes the case of targets as (batchsize, num_categories, ...) where num_categories excludes background class.
Confusion matrix computation is possible almost similarly for (batchsize, ...), but when target is all zero (0, ..., 0)  = no classes (background class),
then confusion matrix does not count any true/false predictions.
vfdev-5 added a commit to vfdev-5/ignite that referenced this issue Aug 2, 2019
Previous CM implementation suffered from the problem if target contains non-contiguous indices.
New implementation is almost taken from torchvision's https://github.com/pytorch/vision/blob/master/references/segmentation/utils.py#L75-L117

This commit also removes the case of targets as (batchsize, num_categories, ...) where num_categories excludes background class.
Confusion matrix computation is possible almost similarly for (batchsize, ...), but when target is all zero (0, ..., 0)  = no classes (background class),
then confusion matrix does not count any true/false predictions.
@vfdev-5 vfdev-5 mentioned this issue Aug 2, 2019
3 tasks
anmolsjoshi pushed a commit that referenced this issue Aug 4, 2019
* Fixes issue #543

Previous CM implementation suffered from the problem if target contains non-contiguous indices.
New implementation is almost taken from torchvision's https://github.com/pytorch/vision/blob/master/references/segmentation/utils.py#L75-L117

This commit also removes the case of targets as (batchsize, num_categories, ...) where num_categories excludes background class.
Confusion matrix computation is possible almost similarly for (batchsize, ...), but when target is all zero (0, ..., 0)  = no classes (background class),
then confusion matrix does not count any true/false predictions.

* Update confusion_matrix.py
vfdev-5 added a commit that referenced this issue Aug 30, 2019
* [WIP] Added cifar10 distributed example

* [WIP] Metric with all reduce decorator and tests

* [WIP] Added tests for accumulation metric

* [WIP] Updated with reinit_is_reduced

* [WIP] Distrib adaptation for other metrics

* [WIP] Warnings for EpochMetric and Precision/Recall when distrib

* Updated metrics and tests to run on distributed configuration
- Test on 2 GPUS single node
- Added cmd in .travis.yml to indicate how to test locally
- Updated travis to run tests in 4 processes

* Minor fixes and cosmetics

* Fixed bugs and improved contrib/cifar10 example

* Updated docs

* Fixes issue #543 (#572)

* Fixes issue #543

Previous CM implementation suffered from the problem if target contains non-contiguous indices.
New implementation is almost taken from torchvision's https://github.com/pytorch/vision/blob/master/references/segmentation/utils.py#L75-L117

This commit also removes the case of targets as (batchsize, num_categories, ...) where num_categories excludes background class.
Confusion matrix computation is possible almost similarly for (batchsize, ...), but when target is all zero (0, ..., 0)  = no classes (background class),
then confusion matrix does not count any true/false predictions.

* Update confusion_matrix.py

* Update metrics.rst

* Updated docs and set device as "cuda" in distributed instead of raising error

* [WIP] Fix missing _is_reduced in precision/recall with tests

* Updated other tests

* Added mlflow logger (#558)

* Added mlflow logger without tests

* Added mlflow tests, updated mlflow logger code and other tests

* Updated docs and added mlflow in travis

* Added tests for mlflow OptimizerParamsHandler
- additionally added OptimizerParamsHandler for plx with tests

* Update to PyTorch v1.2.0 (#580)

* Update .travis.yml

* Update .travis.yml

* Fixed tests and improved travis

* Fix SSL problem of failing travis (#581)

* Update .travis.yml

* Update .travis.yml

* Fixed tests and improved travis

* Fixes SSL problem to download model weights

* Fixed travis for deploy and nightly

* Fixes #583 (#584)

* Fixes docs build warnings (#585)

* Return removable handle from Engine.add_event_handler(). (#588)

* Add tests for event removable handle.

Add feature tests for engine.add_event_handler returning removable event
handles.

* Return RemovableEventHandle from Engine.add_event_handler.

* Fixup removable event handle test in python 2.7.

Explicitly trigger gc, allowing cycle detection between engine and
state, in removable handle weakref test. Python 2.7 cycle detection
appears to be less aggressive than python 3+.

* Add removable event handler docs.

Add autodoc configuration for RemovableEventHandler, expand "concepts"
documentation with event remove example following event add example.

* Update concepts.rst

* Updated travis and renamed tbptt test gpu -> cuda
vfdev-5 added a commit that referenced this issue Oct 24, 2019
* [WIP] Added cifar10 distributed example

* [WIP] Metric with all reduce decorator and tests

* [WIP] Added tests for accumulation metric

* [WIP] Updated with reinit_is_reduced

* [WIP] Distrib adaptation for other metrics

* [WIP] Warnings for EpochMetric and Precision/Recall when distrib

* Updated metrics and tests to run on distributed configuration
- Test on 2 GPUS single node
- Added cmd in .travis.yml to indicate how to test locally
- Updated travis to run tests in 4 processes

* Minor fixes and cosmetics

* Fixed bugs and improved contrib/cifar10 example

* Updated docs

* Update metrics.rst

* Updated docs and set device as "cuda" in distributed instead of raising error

* [WIP] Fix missing _is_reduced in precision/recall with tests

* Updated other tests

* Updated travis and renamed tbptt test gpu -> cuda

* Distrib (#573)

* [WIP] Added cifar10 distributed example

* [WIP] Metric with all reduce decorator and tests

* [WIP] Added tests for accumulation metric

* [WIP] Updated with reinit_is_reduced

* [WIP] Distrib adaptation for other metrics

* [WIP] Warnings for EpochMetric and Precision/Recall when distrib

* Updated metrics and tests to run on distributed configuration
- Test on 2 GPUS single node
- Added cmd in .travis.yml to indicate how to test locally
- Updated travis to run tests in 4 processes

* Minor fixes and cosmetics

* Fixed bugs and improved contrib/cifar10 example

* Updated docs

* Fixes issue #543 (#572)

* Fixes issue #543

Previous CM implementation suffered from the problem if target contains non-contiguous indices.
New implementation is almost taken from torchvision's https://github.com/pytorch/vision/blob/master/references/segmentation/utils.py#L75-L117

This commit also removes the case of targets as (batchsize, num_categories, ...) where num_categories excludes background class.
Confusion matrix computation is possible almost similarly for (batchsize, ...), but when target is all zero (0, ..., 0)  = no classes (background class),
then confusion matrix does not count any true/false predictions.

* Update confusion_matrix.py

* Update metrics.rst

* Updated docs and set device as "cuda" in distributed instead of raising error

* [WIP] Fix missing _is_reduced in precision/recall with tests

* Updated other tests

* Added mlflow logger (#558)

* Added mlflow logger without tests

* Added mlflow tests, updated mlflow logger code and other tests

* Updated docs and added mlflow in travis

* Added tests for mlflow OptimizerParamsHandler
- additionally added OptimizerParamsHandler for plx with tests

* Update to PyTorch v1.2.0 (#580)

* Update .travis.yml

* Update .travis.yml

* Fixed tests and improved travis

* Fix SSL problem of failing travis (#581)

* Update .travis.yml

* Update .travis.yml

* Fixed tests and improved travis

* Fixes SSL problem to download model weights

* Fixed travis for deploy and nightly

* Fixes #583 (#584)

* Fixes docs build warnings (#585)

* Return removable handle from Engine.add_event_handler(). (#588)

* Add tests for event removable handle.

Add feature tests for engine.add_event_handler returning removable event
handles.

* Return RemovableEventHandle from Engine.add_event_handler.

* Fixup removable event handle test in python 2.7.

Explicitly trigger gc, allowing cycle detection between engine and
state, in removable handle weakref test. Python 2.7 cycle detection
appears to be less aggressive than python 3+.

* Add removable event handler docs.

Add autodoc configuration for RemovableEventHandler, expand "concepts"
documentation with event remove example following event add example.

* Update concepts.rst

* Updated travis and renamed tbptt test gpu -> cuda

* Compute IoU, Precision, Recall based on CM on CPU

* Fixes incomplete merge with 1856c8e

* Update distrib branch and CIFAR10 example (#647)

* Added tests with gloo, minor updates and fixes

* Added single/multi node tests with gloo and [WIP] with nccl

* Added tests for multi-node nccl, improved examples/contrib/cifar10 example

* Experiments: 1n1gpu, 1n2gpus, 2n2gpus

* Fix flake8

* Fixes #645 (#646)

- fix CI and improve create_lr_scheduler_with_warmup

* Fix tests for python 2.7

* Finalized Cifar10 example (#649)

* Added gcp tb logger image and updated README

* Added gcp ai platform scripts to run trainings

* Improved docs and readmes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants