-
Notifications
You must be signed in to change notification settings - Fork 491
Multinode guide
This is an introduction to multi-node training with Intel® Distribution of Caffe* framework. All other pages related to multi-node in this wiki are supplementary and they are referred to in this guide. By the end of it, you should understand how multi-node training was implemented in Intel® Distribution of Caffe* and be able to train any topology yourself on a simple cluster. Basic knowledge of BVLC Caffe usage might be necessary to understand it fully. Also be sure to check out the performance optimization guidelines.
To make the practical part of this guide more comprehensible, the instructions assume you have configured from scratch a cluster comprising 4 nodes. You will learn how to configure such a cluster, how to compile Intel® Distribution of Caffe*, how to run a training of a particular model, and how to verify the network actually have trained.
In case you are not interested in how multi-node in Intel® Distribution of Caffe* works and just want to run the training, please skip to the practical part chapter of this Wiki.
Intel® Distribution of Caffe* is designed for both single-node and multi-node operation. Here, the multi-node part is explained.
There are two general approaches to parallelization. Data parallelism and model parallelism. The approach used in Intel® Distribution of Caffe* is the data parallelism.
The data parallelization technique runs training on different batches of data on each of the nodes. The data is split among all nodes but the same model is used. It means that the total batch size in a single iteration is equal to the sum of individual batch sizes of all nodes. For example a network is trained on 8 nodes. All of them have batch size of 128
. The (total) batch size in a single iteration of the Stochastic Gradient Descent algorithm is 8*128=1024
.
Intel® Distribution of Caffe* with MLSL offers two approaches for multi-node training:
- Default - Caffe does Allreduce operation for gradients and then each node is doing SGD locally, followed by Allgather for weights increments.
- Distributed weights update - Caffe does Reduce-Scatter operation for gradients, then each node is doing SGD locally, followed by Allgather for weights increments.
One approach is to divide your training data set into disjoint subsets of roughly equal size. Distribute each subset into each node used for training. Run the multinode training with data layer prepared accordingly, which means either preparing separate proto configurations or placing each subset in exactly the same path for each node.
An easier approach is to simply distribute the full data set on all nodes and configure data layer to draw different subset on each node. Remember to set shuffle:true for the training phase in prototxt. Since each node has its own unique randomizing seed, it will effectively draw unique image subset.
Intel® Distribution of Caffe* is utilizing Intel® Machine Learning Scaling Library (MLSL) which provides communication primitives for data parallelism and model parallelism, communication patterns for SGD and its variants (AdaGrad, Momentum, etc), distributed weight update. It is optimized for Intel® Xeon® and Intel® Xeon Phi (TM) processors and supports Intel® Omni-Path Architecture, Infiniband and Ethernet. Refer to MLSL Wiki or "MLSL Developer Guide and Reference" for more details on the library.
Snapshots are saved only by the node hosting the root process (rank number 0). In order to resume training from a snapshot the file has to be populated across all nodes participating in a training.
If test phase is enabled in the solver’s protobuf file all the nodes are caring out the tests and results are aggregated by Allreduce operation. The validation set needs to be present on every machine which have test phase specified in solver protobuf file. This is important because when you want to use the same solver file on all machines instead of working with multiple protobuf files you need to remember about that.
This chapter explains how to configure a cluster, and what components to install in order to build Intel® Distribution of Caffe* to start distributed training using Intel® Machine Learning Scaling Library.
Hardware assumptions for this guide: 4 machines with IP addresses in the range from 192.161.32.1
to 192.161.32.4
up the cluster
Start from fresh installation of CentOS 7.2 64-bit. The OS image can be downloaded free of charge from the [official website](https://www.c entos.org/download). Minimal ISO is enough. You should install the OS on each node (all 4 in our example). Next upgrade to the latest version of packages (do it on each node):
$ yum upgrade
TIP: You can also execute yum -y upgrade
to suppress the prompt asking for confirmation of the operation (unattended upgrade).
Before installing Intel® Distribution of Caffe* you need to install prerequisites. Start by choosing the master machine (e.g. 192.161.32.1
in our example).
On each machine install “Extra Packages for Enterprise Linux”:
$ yum install epel-release
$ yum clean all
On master machine install "Development Tools" and ansible:
$ yum groupinstall "Development Tools"
$ yum install ansible
Configure ansible's inventory on master machine by adding sections ourmaster and ourcluster in /etc/ansible/
hosts and fill in slave IPs:
[ourmaster]
192.161.31.1
[ourcluster]
192.161.32.[2:4]
On each slave machine configure SSH authentication using master machine’s public key, so that you can log in with ssh without a password. Generate RSA key on master machine:
$ ssh-keygen -t rsa
And copy the public part of the key to slave machines:
$ ssh-copy-id -i ~/.ssh/id_rsa.pub 192.161.32.2
$ ssh-copy-id -i ~/.ssh/id_rsa.pub 192.161.32.3
$ ssh-copy-id -i ~/.ssh/id_rsa.pub 192.161.32.4
Verify ansible works by running ping command from master machine. The slave machines should respond.
$ ansible ourcluster -m ping
Example output:
192.168.31.2 | SUCCESS => {
“changed“: false,
“ping“: “pong“
}
192.168.31.3 | SUCCESS => {
“changed“: false,
“ping“: “pong“
}
192.168.31.4 | SUCCESS => {
“changed“: false,
“ping“: “pong“
}
Master machine can also ping itself by ansible ourmaster -m ping
and entire inventory by ansible all -m ping
.
On master machine use ansible to install packages listed by running the command below for the entire cluster.
$ ansible all -m shell -a 'yum -y install python-devel boost boost-devel cmake numpy \
numpy-devel gflags gflags-devel glog glog-devel protobuf protobuf-devel hdf5 hdf5-devel \
lmdb lmdb-devel leveldb leveldb-devel snappy-devel opencv opencv-devel'
Optionally you can install additional system tools you may find usefull.
$ ansible all -m shell -a 'yum install -y mc cpuinfo htop tmux screen iftop iperf \
vim wget'
You might be required to turn off the firewall on each node (refer to Firewalls and MPI for more information), too.
$ ansible all -m shell -a 'systemctl stop firewalld.service'
The cluster is ready to deploy binaries of Intel® Distribution of Caffe*. Let’s build it now.
This chapter explains how to build Intel® Distribution of Caffe* for multi-node (distributed) training of neural networks.
Download the MLSL Beta 2017 package to master machine. Use ansible to populate installation package to the remaining cluster nodes.
$ ansible ourcluster -m synchronize -a 'src=~/intel-mlsl-devel-64-2017.0-006.x86_64.rpm dest=~/'
Install MLSL on each node in the cluster.
$ ansible all -m shell -a 'rpm -i ~/intel-mlsl-devel-64-2017.0-006.x86_64.rpm'
On master machine execute the following git command in order to obtain the latest snapshot of Intel® Distribution of Caffe* including multi-node support for distributed training.
$ git clone https://github.com/intel/caffe.git intelcaffe
Configure Intel® Machine Learning Scaling Library for the build.
$ source /opt/intel/mlsl_2017.0.003/intel64/bin/mlslvars.sh
Intel® Distribution of Caffe* with MLSL 2017 Beta requires MPI. Choose one option from two available below – MPI development files are required for the build only!
a) Use Intel® MPI for the build
$ source /opt/intel/impi/<version>/bin64/mpivars.sh release
b) Use MPICH for the build (the development package needs to be installed on master machine only – MPICH is not required to run Intel® Distribution of Caffe*)
$ yum install mpich mpich-devel
Note: Build of Intel® Distribution of Caffe* will trigger Intel® Math Kernel Library for Machine Learning (MKLML) to be downloaded to the intelcaffe/external/mkl/
directory and automatically configured.
This section covers only the portion required to build Intel® Distribution of Caffe* with multi-node support using Makefile. Please refer to Caffe documentation for general information on how to build Caffe using Makefile.
Start by changing work directory to the location where Intel® Distribution of Caffe* repository have been downloaded (e.g. ~/intelcaffe
).
$ cd ~/intelcaffe
Make a copy of Makefile.config.example
, and name it Makefile.config
$ cp Makefile.config.example Makefile.config
Open Makefile.config
in your favorite editor and uncomment USE_MLSL
variable.
# Intel(r) Machine Learning Scaling Library (uncomment to build with MLSL)
USE_MLSL := 1
Execute make command to build Intel® Distribution of Caffe* with multi-node support.
$ make -j <number_of_physical_cores> -k
This section covers only the portion required to build Intel® Distribution of Caffe* with multi-node support using CMake. Please refer to Caffe documentation for general information on how to build Caffe using CMake.
Start by changing work directory to the location where Intel® Distribution of Caffe* repository have been downloaded (e.g. ~/intelcaffe
).
$ cd ~/intelcaffe
Create build directory and change work directory to build directory.
$ mkdir build
$ cd build
Execute the following CMake command in order to prepare the build
$ cmake .. -DBLAS=mkl -DUSE_MLSL=1 -DCPU_ONLY=1
Execute make command to build Intel® Distribution of Caffe* with multi-node support.
$ make -j <number_of_physical_cores> -k
$ cd ..
After successful build synchronize intelcaffe directories on the slave machines.
$ ansible ourcluster -m synchronize -a ‘src=~/intelcaffe dest=~/’
Instructions on how to train CIFAR10 and GoogLeNet are explained in more details in Multi-node CIFAR10 tutorial and Multi-node GoogLeNet tutorial. It is recommended to do CIFAR10 tutorial before you proceed. Here, the GoogLeNet will be trained on 4 node cluster. If you want to learn more about GoogLeNet training see the tutorial mentioned above as well.
Before you can train anything you need to prepare the dataset. It is assumed that you have already downloaded the ImageNet training and validation datasets, and they are stored on each node in /home/data/imagenet/train
directory for training set and /home/data/imagenet/val
for validation set. For details you can look at the Data Preparation section of BVLC Caffe examples at http://caffe.berkeleyvision.org/gathered/examples/imagenet.html. You can use your own data sets as well.
Next step is to create machine file ~/mpd.hosts
on master node for controlling the placement of MPI process across the machines:
192.161.32.1
192.161.32.2
192.161.32.3
192.161.32.4
Update your model file models/bvlc_googlenet/train_val_client.prototxt
:
name: "GoogleNet"
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
crop_size: 224
mean_value: 104
mean_value: 117
mean_value: 123
}
image_data_param {
source: "/home/data/train.txt"
batch_size: 256
shuffle: true
}
}
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
crop_size: 224
mean_value: 104
mean_value: 117
mean_value: 123
}
image_data_param {
source: "/home/data/val.txt"
batch_size: 50
new_width: 256
new_height: 256
}
}
Synchronize the intelcaffe
directories, change your working directory to intelcaffe
and start the training process with the following command:
$ mpirun -n 4 -ppn 1 -machinefile ~/mpd.hosts ./build/tools/caffe train \
--solver=models/bvlc_googlenet/solver_client.prototxt -param_server=mlsl \
2>&1 | tee ~/intelcaffe/multinode_train.out
Log from the training process will be written to multinode_train.out file.
When the training is finished, you can test how your network has trained with the following command:
$ ./build/tools/caffe test --model=models/bvlc_googlenet/train_val_client.prototxt
--weights=multinode_googlenet_iter_100000.caffemodel --iterations=1000
Look at the bottom lines of output from the above command which contains loss3/top-1
and loss3/top-5
. The values should be around loss3/top-1 = 0.69
and loss3/top-5 = 0.886
.
For more information about caffe test visit Caffe interfaces website at http://caffe.berkeleyvision.org/tutorial/interfaces.html.