Skip to content

Commit

Permalink
Distrib (#635)
Browse files Browse the repository at this point in the history
* [WIP] Added cifar10 distributed example

* [WIP] Metric with all reduce decorator and tests

* [WIP] Added tests for accumulation metric

* [WIP] Updated with reinit_is_reduced

* [WIP] Distrib adaptation for other metrics

* [WIP] Warnings for EpochMetric and Precision/Recall when distrib

* Updated metrics and tests to run on distributed configuration
- Test on 2 GPUS single node
- Added cmd in .travis.yml to indicate how to test locally
- Updated travis to run tests in 4 processes

* Minor fixes and cosmetics

* Fixed bugs and improved contrib/cifar10 example

* Updated docs

* Update metrics.rst

* Updated docs and set device as "cuda" in distributed instead of raising error

* [WIP] Fix missing _is_reduced in precision/recall with tests

* Updated other tests

* Updated travis and renamed tbptt test gpu -> cuda

* Distrib (#573)

* [WIP] Added cifar10 distributed example

* [WIP] Metric with all reduce decorator and tests

* [WIP] Added tests for accumulation metric

* [WIP] Updated with reinit_is_reduced

* [WIP] Distrib adaptation for other metrics

* [WIP] Warnings for EpochMetric and Precision/Recall when distrib

* Updated metrics and tests to run on distributed configuration
- Test on 2 GPUS single node
- Added cmd in .travis.yml to indicate how to test locally
- Updated travis to run tests in 4 processes

* Minor fixes and cosmetics

* Fixed bugs and improved contrib/cifar10 example

* Updated docs

* Fixes issue #543 (#572)

* Fixes issue #543

Previous CM implementation suffered from the problem if target contains non-contiguous indices.
New implementation is almost taken from torchvision's https://github.com/pytorch/vision/blob/master/references/segmentation/utils.py#L75-L117

This commit also removes the case of targets as (batchsize, num_categories, ...) where num_categories excludes background class.
Confusion matrix computation is possible almost similarly for (batchsize, ...), but when target is all zero (0, ..., 0)  = no classes (background class),
then confusion matrix does not count any true/false predictions.

* Update confusion_matrix.py

* Update metrics.rst

* Updated docs and set device as "cuda" in distributed instead of raising error

* [WIP] Fix missing _is_reduced in precision/recall with tests

* Updated other tests

* Added mlflow logger (#558)

* Added mlflow logger without tests

* Added mlflow tests, updated mlflow logger code and other tests

* Updated docs and added mlflow in travis

* Added tests for mlflow OptimizerParamsHandler
- additionally added OptimizerParamsHandler for plx with tests

* Update to PyTorch v1.2.0 (#580)

* Update .travis.yml

* Update .travis.yml

* Fixed tests and improved travis

* Fix SSL problem of failing travis (#581)

* Update .travis.yml

* Update .travis.yml

* Fixed tests and improved travis

* Fixes SSL problem to download model weights

* Fixed travis for deploy and nightly

* Fixes #583 (#584)

* Fixes docs build warnings (#585)

* Return removable handle from Engine.add_event_handler(). (#588)

* Add tests for event removable handle.

Add feature tests for engine.add_event_handler returning removable event
handles.

* Return RemovableEventHandle from Engine.add_event_handler.

* Fixup removable event handle test in python 2.7.

Explicitly trigger gc, allowing cycle detection between engine and
state, in removable handle weakref test. Python 2.7 cycle detection
appears to be less aggressive than python 3+.

* Add removable event handler docs.

Add autodoc configuration for RemovableEventHandler, expand "concepts"
documentation with event remove example following event add example.

* Update concepts.rst

* Updated travis and renamed tbptt test gpu -> cuda

* Compute IoU, Precision, Recall based on CM on CPU

* Fixes incomplete merge with 1856c8e

* Update distrib branch and CIFAR10 example (#647)

* Added tests with gloo, minor updates and fixes

* Added single/multi node tests with gloo and [WIP] with nccl

* Added tests for multi-node nccl, improved examples/contrib/cifar10 example

* Experiments: 1n1gpu, 1n2gpus, 2n2gpus

* Fix flake8

* Fixes #645 (#646)

- fix CI and improve create_lr_scheduler_with_warmup

* Fix tests for python 2.7

* Finalized Cifar10 example (#649)

* Added gcp tb logger image and updated README

* Added gcp ai platform scripts to run trainings

* Improved docs and readmes
  • Loading branch information
vfdev-5 authored Oct 24, 2019
1 parent e223e9e commit 53190db
Show file tree
Hide file tree
Showing 52 changed files with 3,392 additions and 347 deletions.
13 changes: 11 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,16 @@ before_install: &before_install

install:
- python setup.py install
- pip install numpy mock pytest codecov pytest-cov
- pip install numpy mock pytest codecov pytest-cov pytest-xdist
# Examples dependencies
- pip install matplotlib pandas
- pip install gym==0.10.11

script:
- py.test --cov ignite --cov-report term-missing
- CUDA_VISIBLE_DEVICES="" py.test --tx 4*popen//python=python$TRAVIS_PYTHON_VERSION --cov ignite --cov-report term-missing -vvv tests/
# Run test on cuda device
# As no GPUs on travis -> all tests will be skipped
- CUDA_VISIBLE_DEVICES=0 py.test --cov ignite --cov-append --cov-report term-missing -vvv tests/ -k "on_cuda"

# Smoke tests for the examples
# Mnist
Expand Down Expand Up @@ -72,6 +75,12 @@ script:
- mkdir -p /home/travis/.cache/torch/checkpoints/ && wget "https://download.pytorch.org/models/vgg16-397923af.pth" -O/home/travis/.cache/torch/checkpoints/vgg16-397923af.pth
- python examples/fast_neural_style/neural_style.py train --epochs 1 --cuda 0 --dataset test --dataroot . --image_size 32 --style_image examples/fast_neural_style/images/style_images/mosaic.jpg --style_size 32

# tests for distributed ops
# As no GPUs on travis -> all tests will be skipped
# 2 is the number of processes <-> number of available GPUs
- export WORLD_SIZE=2
- py.test --cov ignite --cov-append --cov-report term-missing --dist=each --tx $WORLD_SIZE*popen//python=python$TRAVIS_PYTHON_VERSION tests -m distributed -vvv

after_success:
- codecov

Expand Down
38 changes: 36 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,42 @@ The code in **ignite.contrib** is not as fully maintained as the core part of th

Examples
========
Please check out the `examples
<https://github.com/pytorch/ignite/tree/master/examples>`_ to see how to use `ignite` to train various types of networks, as well as how to use `visdom <https://github.com/facebookresearch/visdom>`_ or `tensorboardX <https://github.com/lanpa/tensorboard-pytorch>`_ for training visualizations.

We provide several examples ported from `pytorch/examples <https://github.com/pytorch/examples>`_ using `ignite`
to display how it helps to write compact and full-featured training loops in a few lines of code:

MNIST example
--------------

Basic neural network training on MNIST dataset with/without `ignite.contrib` module:

- `MNIST with ignite.contrib TQDM/Tensorboard/Visdom loggers <https://github.com/pytorch/ignite/tree/master/examples/contrib/mnist>`_
- `MNIST with native TQDM/Tensorboard/Visdom logging <https://github.com/pytorch/ignite/tree/master/examples/mnist>`_

Distributed CIFAR10 example
---------------------------

Training a small variant of ResNet on CIFAR10 in various configurations: 1) single gpu, 2) single node multiple gpus, 3) multiple nodes and multilple gpus.

- `CIFAR10 <https://github.com/pytorch/ignite/tree/master/examples/contrib/cifar10>`_


Other examples
--------------

- `DCGAN <https://github.com/pytorch/ignite/tree/master/examples/gan>`_
- `Reinforcement Learning <https://github.com/pytorch/ignite/tree/master/examples/reinforcement_learning>`_
- `Fast Neural Style <https://github.com/pytorch/ignite/tree/master/examples/fast_neural_style>`_


Notebooks
---------

- `Text Classification using Convolutional Neural Networks <https://github.com/pytorch/ignite/blob/master/examples/notebooks/TextCNN.ipynb>`_
- `Variational Auto Encoders <https://github.com/pytorch/ignite/blob/master/examples/notebooks/VAE.ipynb>`_
- `Training Cycle-GAN on Horses to Zebras <https://github.com/pytorch/ignite/blob/master/examples/notebooks/CycleGAN.ipynb>`_
- `Finetuning EfficientNet-B0 on CIFAR100 <https://github.com/pytorch/ignite/blob/master/examples/notebooks/EfficientNet_Cifar100_finetuning.ipynb>`_
- `Convolutional Neural Networks for Classifying Fashion-MNIST Dataset <https://github.com/pytorch/ignite/blob/master/examples/notebooks/FashionMNIST.ipynb>`_


Contributing
Expand Down
26 changes: 21 additions & 5 deletions docs/source/examples.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,33 @@
Examples
========

Scripts
-------

There are several examples ported from `pytorch/examples <https://github.com/pytorch/examples>`_ using `ignite`
We provide several examples ported from `pytorch/examples <https://github.com/pytorch/examples>`_ using `ignite`
to display how it helps to write compact and full-featured training loops in a few lines of code:

- `Mnist <https://github.com/pytorch/ignite/tree/master/examples/mnist>`_
MNIST example
-------------

Basic neural network training on MNIST dataset with/without `ignite.contrib` module:

- `MNIST with ignite.contrib TQDM/Tensorboard/Visdom loggers <https://github.com/pytorch/ignite/tree/master/examples/contrib/mnist>`_
- `MNIST with native TQDM/Tensorboard/Visdom logging <https://github.com/pytorch/ignite/tree/master/examples/mnist>`_

Distributed CIFAR10 example
---------------------------

Training a small variant of ResNet on CIFAR10 in various configurations: 1) single gpu, 2) single node multiple gpus, 3) multiple nodes and multilple gpus.

- `CIFAR10 <https://github.com/pytorch/ignite/tree/master/examples/contrib/cifar10>`_


Other examples
--------------

- `DCGAN <https://github.com/pytorch/ignite/tree/master/examples/gan>`_
- `Reinforcement Learning <https://github.com/pytorch/ignite/tree/master/examples/reinforcement_learning>`_
- `Fast Neural Style <https://github.com/pytorch/ignite/tree/master/examples/fast_neural_style>`_


Notebooks
---------

Expand Down
183 changes: 150 additions & 33 deletions docs/source/metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,65 +7,182 @@ fashion without having to store the entire output history of a model.
In practice a user needs to attach the metric instance to an engine. The metric
value is then computed using the output of the engine's `process_function`:

.. code-block:: python
.. code-block:: python
def process_function(engine, batch):
# ...
return y_pred, y
def process_function(engine, batch):
# ...
return y_pred, y
engine = Engine(process_function)
metric = Accuracy()
metric.attach(engine, "accuracy")
engine = Engine(process_function)
metric = Accuracy()
metric.attach(engine, "accuracy")
If the engine's output is not in the format `y_pred, y`, the user can
use the `output_transform` argument to transform it:

.. code-block:: python
def process_function(engine, batch):
# ...
return {'y_pred': y_pred, 'y_true': y, ...}
engine = Engine(process_function)
def output_transform(output):
# `output` variable is returned by above `process_function`
y_pred = output['y_pred']
y = output['y_true']
return y_pred, y # output format is according to `Accuracy` docs
metric = Accuracy(output_transform=output_transform)
metric.attach(engine, "accuracy")
.. Note ::
Most of implemented metrics are adapted to distributed computations and reduce their internal states across the GPUs
before computing metric value. This can be helpful to run the evaluation on multiple nodes/GPU instances with a
distributed data sampler. Following code snippet shows in detail how to adapt metrics:
.. code-block:: python
def process_function(engine, batch):
# ...
return {'y_pred': y_pred, 'y_true': y, ...}
device = "cuda:{}".format(local_rank)
model = torch.nn.parallel.DistributedDataParallel(model,
device_ids=[local_rank, ],
output_device=local_rank)
test_sampler = DistributedSampler(test_dataset)
test_loader = DataLoader(test_dataset, batch_size=batch_size, sampler=test_sampler,
num_workers=num_workers, pin_memory=True)
engine = Engine(process_function)
evaluator = create_supervised_evaluator(model, metrics={'accuracy': Accuracy(device=device)}, device=device)
def output_transform(output):
# `output` variable is returned by above `process_function`
y_pred = output['y_pred']
y = output['y_true']
return y_pred, y # output format is according to `Accuracy` docs
metric = Accuracy(output_transform=output_transform)
metric.attach(engine, "accuracy")
Metric arithmetics
------------------

Metrics could be combined together to form new metrics. This could be done through arithmetics, such
as ``metric1 + metric2``, use PyTorch operators, such as ``(metric1 + metric2).pow(2).mean()``,
or use a lambda function, such as ``MetricsLambda(lambda a, b: torch.mean(a + b), metric1, metric2)``.

For example:

.. code-block:: python
.. code-block:: python
precision = Precision(average=False)
recall = Recall(average=False)
F1 = (precision * recall * 2 / (precision + recall)).mean()
precision = Precision(average=False)
recall = Recall(average=False)
F1 = (precision * recall * 2 / (precision + recall)).mean()
.. note:: This example computes the mean of F1 across classes. To combine
precision and recall to get F1 or other F metrics, we have to be careful
that `average=False`, i.e. to use the unaveraged precision and recall,
otherwise we will not be computing F-beta metrics.
.. note:: This example computes the mean of F1 across classes. To combine
precision and recall to get F1 or other F metrics, we have to be careful
that `average=False`, i.e. to use the unaveraged precision and recall,
otherwise we will not be computing F-beta metrics.

Metrics also support indexing operation (if metric's result is a vector/matrix/tensor). For example, this can be useful to compute mean metric (e.g. precision, recall or IoU) ignoring the background:

.. code-block:: python
.. code-block:: python
cm = ConfusionMatrix(num_classes=10)
iou_metric = IoU(cm)
iou_no_bg_metric = iou_metric[:9] # We assume that the background index is 9
mean_iou_no_bg_metric = iou_no_bg_metric.mean()
# mean_iou_no_bg_metric.compute() -> tensor(0.12345)
How to create a custom metric
-----------------------------

To create a custom metric one needs to create a new class inheriting from :class:`~ignite.metrics.Metric` and override
three methods :

- `reset()` : resets internal variables and accumulators
- `update(output)` : updates internal variables and accumulators with provided batch output `(y_pred, y)`
- `compute()` : computes custom metric and return the result

For example, we would like to implement for illustration purposes a multi-class accuracy metric with some
specific condition (e.g. ignore user-defined classes):

.. code-block:: python
from ignite.metrics import Metric
from ignite.exceptions import NotComputableError
# These decorators helps with distributed settings
from ignite.metrics.metric import sync_all_reduce, reinit__is_reduced
class CustomAccuracy(Metric):
def __init__(self, ignored_class, output_transform=lambda x: x, device=None):
self.ignored_class = ignored_class
self._num_correct = None
self._num_examples = None
super(CustomAccuracy, self).__init__(output_transform=output_transform, device=device)
@reinit__is_reduced
def reset(self):
self._num_correct = 0
self._num_examples = 0
super(CustomAccuracy, self).reset()
@reinit__is_reduced
def update(self, output):
y_pred, y = output
indices = torch.argmax(y_pred, dim=1)
mask = (y != self.ignored_class)
mask &= (indices != self.ignored_class)
y = y[mask]
indices = indices[mask]
correct = torch.eq(indices, y).view(-1)
self._num_correct += torch.sum(correct).item()
self._num_examples += correct.shape[0]
@sync_all_reduce("_num_examples", "_num_correct")
def compute(self):
if self._num_examples == 0:
raise NotComputableError('CustomAccuracy must have at least one example before it can be computed.')
return self._num_correct / self._num_examples
We imported necessary classes as :class:`~ignite.metrics.Metric`, :class:`~ignite.exceptions.NotComputableError` and
decorators to adapt the metric for distributed setting. In `reset` method, we reset internal variables `_num_correct`
and `_num_examples` which are used to compute the custom metric. In `updated` method we define how to update
the internal variables. And finally in `compute` method, we compute metric value.

We can check this implementation in a simple case:

.. code-block:: python
import torch
torch.manual_seed(8)
m = CustomAccuracy(ignored_class=3)
batch_size = 4
num_classes = 5
y_pred = torch.rand(batch_size, num_classes)
y = torch.randint(0, num_classes, size=(batch_size, ))
m.update((y_pred, y))
res = m.compute()
print(y, torch.argmax(y_pred, dim=1))
# Out: tensor([2, 2, 2, 3]) tensor([2, 1, 0, 0])
print(m._num_correct, m._num_examples, res)
# Out: 1 3 0.3333333333333333
Metrics and distributed computations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

cm = ConfusionMatrix(num_classes=10)
iou_metric = IoU(cm)
iou_no_bg_metric = iou_metric[:9] # We assume that the background index is 9
mean_iou_no_bg_metric = iou_no_bg_metric.mean()
# mean_iou_no_bg_metric.compute() -> tensor(0.12345)
In the above example, `CustomAccuracy` constructor has `device` argument and `reset`, `update`, `compute` methods are decorated with `reinit__is_reduced`, `sync_all_reduce`. The purpose of these features is to adapt metrics in distributed computations on CUDA devices and assuming the backend to support `"all_reduce" operation <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`_. User can specify the device (by default, `cuda`) at metric's initialization. This device _can_ be used to store internal variables on and to collect all results from all participating devices. More precisely, in the above example we added `@sync_all_reduce("_num_examples", "_num_correct")` over `compute` method. This means that when `compute` method is called, metric's interal variables `self._num_examples` and `self._num_correct` are summed up over all participating devices. Therefore, once collected, these internal variables can be used to compute the final metric value.


Complete list of metrics:
Complete list of metrics
------------------------

- :class:`~ignite.metrics.Accuracy`
- :class:`~ignite.metrics.Average`
Expand Down
6 changes: 6 additions & 0 deletions examples/contrib/cifar10/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
output
cifar10
.polyaxonignore
.polyaxon
plx_configs
gcp_configs
Loading

0 comments on commit 53190db

Please sign in to comment.