U-Net for medical image segmentation, optimised for Graphcore's IPU.
Framework | Domain | Model | Datasets | Tasks | Training | Inference | Reference |
---|---|---|---|---|---|---|---|
TensorFlow 2 | Vision | EfficientDet | ISBI Challenge 2012 | Object detection | ✅ |
✅ |
'U-Net: Convolutional Networks for Biomedical Image Segmentation' |
-
Install and enable the Poplar SDK (see Poplar SDK setup)
-
Install the system and Python requirements (see Environment setup)
-
Download the ISBI Challenge 2012 dataset (See Dataset setup)
To check if your Poplar SDK has already been enabled, run:
echo $POPLAR_SDK_ENABLED
If no path is provided, then follow these steps:
-
Navigate to your Poplar SDK root directory
-
Enable the Poplar SDK with:
cd poplar-<OS version>-<SDK version>-<hash>
. enable.sh
- Additionally, enable PopART with:
cd popart-<OS version>-<SDK version>-<hash>
. enable.sh
More detailed instructions on setting up your Poplar environment are available in the Poplar quick start guide.
To prepare your environment, follow these steps:
- Create and activate a Python3 virtual environment:
python3 -m venv <venv name>
source <venv path>/bin/activate
-
Navigate to the Poplar SDK root directory
-
Install the TensorFlow 2 and IPU TensorFlow add-ons wheels:
cd <poplar sdk root dir>
pip3 install tensorflow-2.X.X...<OS_arch>...x86_64.whl
pip3 install ipu_tensorflow_addons-2.X.X...any.whl
For the CPU architecture you are running on
- Install the Keras wheel:
pip3 install --force-reinstall --no-deps keras-2.X.X...any.whl
For further information on Keras on the IPU, see the documentation and the tutorial.
-
Navigate to this example's root directory
-
Install the Python requirements:
pip3 install -r requirements.txt
- Build the custom ops:
make
More detailed instructions on setting up your TensorFlow 2 environment are available in the TensorFlow 2 quick start guide.
Download the dataset from the source. The training data is a set of 30 sections from a serial section Transmission Electron Microscopy (ssTEM) data set of the Drosophila first instar larva ventral nerve cord (VNC).
The data preprocessing includes normalization and data augmentation similar to this one.
Disk space required: 5MB
.
├── data
└── README.md
1 directory, 1 file
To run a tested and optimised configuration and to reproduce the performance shown on our performance results page, use the examples_utils
module (installed automatically as part of the environment setup) to run one or more benchmarks. The benchmarks are provided in the benchmarks.yml
file in this example's root directory.
For example:
python3 -m examples_utils benchmark --spec <path to benchmarks.yml file>
Or to run a specific benchmark in the benchmarks.yml
file provided:
python3 -m examples_utils benchmark --spec <path to benchmarks.yml file> --benchmark <name of benchmark>
For more information on using the examples-utils benchmarking module, please refer to the README.
Sample command to train the model on 4 IPUs:
python3 main.py --nb-ipus-per-replica 4 --micro-batch-size 1 --gradient-accumulation-count 24 --num-epochs 2100 --train --augment --learning-rate 0.0024
The training curve is shown below. The micro batch size is 1, gradient accumulation is set to 24, giving an effective batch size of 24 images, running for 2100 steps. The SGD optimiser is used here with momentum of 0.99 and a exponential decay learning rate schedule. The input dataset is augmented and repeated infinitely. We used large steps per execution to train multiple epochs in one execution, which improved the performance on IPU.
The 5-fold cross validation accuracy reached 0.8917 on average with this command:
python main.py --nb-ipus-per-replica 4 --micro-batch-size 1 --gradient-accumulation-count 24 --num-epochs 2100 --train --augment --learning-rate 0.0024 --eval --kfold 5 --eval-freq 10
The inference model can fit on 1 IPU. The following command runs an inference benchmark with host generated random data on 1 IPU:
python main.py --nb-ipus-per-replica 1 --micro-batch-size 2 --steps-per-execution 400 --infer --host-generated-data --benchmark
By using data-parallelism, we can improve throughput by replicating the graph over the 4 IPUs in an M2000. The following command shows this:
python main.py --nb-ipus-per-replica 1 --micro-batch-size 2 --steps-per-execution 400 --infer --host-generated-data --replicas 4 --benchmark
We used the BCE-Dice loss that combines the dice loss and binary crossentropy loss.
The accuracy is measured by the Dice score.
First, for any TensorFlow 2 Keras model, we need to add the following elements to execute the graph on the IPU. More details can be found in this tutorial.
- Import the IPU TensorFlow 2 libraries that come with the Poplar SDK.
- Prepare the dataset for infererence, training and validation.
- Configure your IPU system. This sets up a single or multi-IPU device for execution.
- Create an IPU strategy. This works as a context manager: creating variables and Keras models within the scope of the
IPUStrategy
will ensure that they are placed on the IPU. - Run the Keras model within the IPU strategy scope.
We take advantage of a number of memory optimisation techniques in order to train on 512x512 sized images. These are reviewed below.
Model pipelining is a technique for splitting a model across multiple devices (known as model parallelism). For simple examples of how to pipeline models in Keras, take a look at the Code examples.
You should consider using model parallel execution if your model goes out of memory (OOM) on a single IPU, assuming that you cannot reduce your micro batch size. Some of the techniques to optimise the pipeline can be found in Optimising the pipeline.
To pipeline your model, you first need to decide how to split the layers of the model across the IPUs you have available. In this example, we split the model over 4 IPUs and place a stage on each IPU.
The splits are shown in the set_pipeline_stages(model)
in model_utils.py
. All the layers are named using the name
parameter in model.py
. You can call assignment.layer.name.startswith(“NAME”)
inside the pipeline stage assignment to move to the next pipeline stage.
Finding the optimal pipeline split can be an empirical process as you find that some splits give better distribution of memory over IPUs than others. When memory allows, a good pipelining split also needs to balance compute to reduce latency. Popvision is a useful tool to visualise the memory distribution and execution trace. The pipeline split of this UNet model is shown in the figure below.
We train in 16 bit floating point precision (FP16) to reduce the memory needed for parameters and activations. We set the datatype using keras.backend.set_floatx(args.dtype)
.
If using FP16, you may find that the loss and accuracy for large images go out of the representable range of FP16. The loss evaluates the class predictions for each pixel vector individually and then averages over all pixels. The accuracy is defined as the percentage of pixels in an image that are classified correctly. For both calculations, we need to divide by the total number of pixels in the image, which can often be out of the representable range for FP16. Therefore the datatype for the loss and accuracy is explicitly set to FP32.
Partials are the results of intermediate calculations in convolution and matrix multiplication operations. By default these are kept in FP32. However, for this model we set them to FP16 using the partialsType
option in the convolution and matmul options (utils.py
).
To mitigate against very small gradient updates that cannot be represented in FP16 (and instead underflow to become zero), we use loss scaling with a fixed value. This means that the loss is multiplied by a constant (in this case, 128) which scales the gradient updates by the same factor, pushing them into the representable range. After backpropagating, before the weight update, the gradients are correspondingly scaled by 1/128. We use the loss scaling that is native to tf.keras: tf.keras.mixed_precision.LossScaleOptimizer()
.
Tensors are stored in memory (referred to as "live") as long as they are required. By default, we need to store the activations in the forward pass until they are consumed during backpropagation. In the case of pipelining, tensors can be kept live for quite a long time. We can reduce this liveness by recomputing the activations.
Rather than storing all the activations within a pipeline stage, we retain only the activations that feed the input of the stage (called a "stash"). The other internal activations within the stage are calculated from the stashes just before they are needed in the backward pass for a given micro batch. The stash size is equivalent to the number of pipeline stages, as that reflects the number of micro batches being processed in parallel. Hence as you increase the number of stages in a pipeline, the stash overhead also increases accordingly.
Recomputation can be enabled by setting the allow_recompute
in IPUConfig
. Enabling this option can reduce memory usage at the expense of extra computation. For smaller models, it can allow us to increase micro batch size and therefore efficiency.
A demonstration of the pipeline recomputation can be found in Recomputation.
To modify the execution behaviour of convolutions, options can be found in the "Convolution options". We adjust some of these to reduce memory usage in set_convolution_options
:
-
Change the
availableMemoryProportion
. This is the proportion of IPU memory that can be used as temporary memory by a convolution or matrix multiplication. The default proportion is set to 0.6, which aims to balance execution speed against memory. To fit larger models on the IPU, a good first step is to lower the available memory proportion to force the compiler to optimise for memory use over execution speed. Less temporary memory means longer cycles to execute. It also increases always-live memory as more control code is needed to deal with the planning of the split calculations. Reducing this value too far can result in OOM. We recommend to set this value greater than 0.05. -
Change
partialsType
data type used for intermediate calculations (see Reduced Precision above).
There are various pipeline schedules available. They each have different benefits in terms of memory use, cycle balance across IPUs, and other available optimisations. More details can be found in Pipeline scheduling. For U-Net, we use the Interleaved schedule when the model does not fit in memory with Grouped schedule. The Interleaved schedule usually uses less memory and has less buffering between stages than the default Grouped pipeline schedule. When the model fits, the Grouped schedule gives much better throughput than the Interleaved schedule.
To further reduce the memory usage, we can also change the internalExchangeOptimisationTarget
from default cycles
to memory
. "Exchange" refers to the communication phase between IPUs, which is pre-planned by the compiler during graph compilation. We can influence the planning of exchanges to optimise for memory and/or throughput. More details can be found in the list of Engine creation options: Optimisations. When a model can fit on IPUs, the cycles
can achieve better speed than the memory
and balanced
options.
We use the ipu_tensorflow_addons.keras.layers.Dropout
layer, rather than the one native to keras
. Our custom dropout is designed to use less memory by not storing the dropout mask between forward and backward passes.
We can control each pipeline stage using forward_propagation_stages_poplar_options
and backward_propagation_stages_poplar_options
. Looking at the memory report from profiling, for the stages that do not fit on the IPU, we can try to change the available memory proportion on that stage like in the get_pipeline_stage_options
in utils.py
. More details about this option can be found in Profiling.
Some optimisers have an optimiser state which is only accessed and modified during the weight update. The optimiser state variables do not need to be stored in the device memory during the forward and backward propagation of the model, as they are only required during the weight update. Therefore, they are streamed onto the device during the weight update and then streamed back to remote memory after they have been updated. This feature is enabled by default for pipelined models. If memory allows, disabling this option by setting offload_weight_update_variables=False
in pipeline options can increase the throughput because no communication between remote buffers and the IPU device is needed.
The bottleneck of the throughput in this UNet model is the skip connections between IPUs. The large activations calculated on the first and second IPUs are passed through the pipeline to send to the third and fourth IPUs. To reduce the cycles needed to transfer data between IPUs, we can crop the activations before sending them to concatenate on later stages. The default behaviour in the tf.image.central_crop
is an inplace slice of the original image, which keeps the whole tensor live and takes the same memory as before the crop. We created a custom op in CentralCrop_ipu.py
for the central_crop
(see the Poplar code in folder /custom_crop_op
). In this custom op, the sliced part is no longer inplace slice. The slice is copied to a new smaller tensor, hence eliminating the large input tensor. This helps to reduce the memory usage, data to transfer and, as a result, improves the throughput.