-
Notifications
You must be signed in to change notification settings - Fork 1
Final Distributed Benchmarking Results for Sparse
- We have launched a cluster of 16 machines using the cloudformation template. You can find the cloudformation template here.
- Below we show experiments for different numbers of machines. The setup on each machine is one worker and server process. For example, if it is a single machine, it will have a single worker and server process running on it. If it is two machines it will have a server and worker process running on each of the two machines.
- The ami id used is the following: ami-153b1b70. This is an AMI built on top of the Deep Learning AMI.
- The mxnet repo sits on EFS which is mounted on each of the machines(EC2 instances).
- We run all the following experiments in the dist_async mode and measure the results. For every new experiment with a different dataset we have used a different CFN cluster. So we terminate a cluster and launch a new cluster after doing experiments for 1, 2, 4, 8, 12 and 16 machines for one dataset.
- We are using the sparse_end2end.py for doing the benchmark.
- The script contains a linear regression model (without the bias term) with softmax output and uses SGD optimizer to update the weights. The input is CSR ndarray and the weight matrix is RSP. The operator used is dot(csr, rsp). We run the training for 10 epochs and measure the results. An example usage of the script is as follows :
python <MXNET_DIR>/benchmark/python/sparse/sparse_end2end.py --output-dim 1024 --batch-size=32784 --dataset=avazu --enable-logging-for=0,1,2,3,4,5,6,7,8,9,10,11 --kvstore=dist_async --num-epoch=10
The above script should be launched with a launch.py script as follows :
python <MXNET_DIR>/tools/launch.py -n 12 -H /opt/deeplearning/workers <ABOVE_COMMAND>
-
The first script tells us to launch the sparse script for avazu dataset with output_dim which will be the number of columns for weight matrix as 1024, and the batch_size which will be the number of rows for the input matrix.
-
It also tells to enable logging for 0 -> 11 workers and run it in a distributed setting with kvstore dist_async mode.
-
The second script tells us that you should launch the script with 12 workers and 12 server processes on the hosts mentioned in /opt/deeplearning/workers. Here are the contents of the hosts file for 12 workers:
deeplearning-worker0
deeplearning-worker1
deeplearning-worker2
deeplearning-worker3
deeplearning-worker4
deeplearning-worker5
deeplearning-worker6
deeplearning-worker7
deeplearning-worker8
deeplearning-worker9
deeplearning-worker10
deeplearning-worker11
-
Therefore it will launch a server and worker process on each of the hosts. We are using a similar setup for all the experiments and thus each host which is participating in distributed training will have 1 worker and 1 server process.
-
The commit id used for the experiment is
065adb3702c110af7b537799be3ec9c16c27a72b
. -
The experiment has been run 6 times for each num_machines row and the runtime results for workers have been averaged.
The experiments have been performed for a commit before the following PR was merged
:: https://github.com/apache/incubator-mxnet/pull/8111. One of the factors contributing to the variance in the number of batches in between runs on a single worker is the reset()
call made at the start before one round of data iterator is finished leading to different num_batches for different runs. Also, the large variance in worker runtimes for different workers in distributed training is because of partitioning not guaranteed to be even. (https://github.com/apache/incubator-mxnet/pull/8111/files#diff-92c0e227fde4ecdd4d0aa9ac2a5529f9R225). See the full dataset experiment below for kdda. We still see a lot of variance among workers because of partitioning not being even.
- NUM_MACHINES: The number of machines to run the experiment on
- SLOWEST RUNTIME AMONG WORKER: MAX(All WORKERi end to end runtimes)
- WORKERi: The end to end runtime for WORKERi (for 10 epochs) averaged over multiple runs.
- SPEEDUP for [num_machines=m]: (SLOWEST RUNTIME AMONG WORKER for num_machines=m) / (SLOWEST RUNTIME AMONG WORKER for num_machines=1)
NUM_MACHINES | SLOWEST RUNTIME AMONG WORKER(seconds) | SPEEDUP | WORKER0(seconds) | WORKER1(seconds) | WORKER2(seconds) | WORKER3(seconds) | WORKER4(seconds) | WORKER5(seconds) | WORKER6(seconds) | WORKER7(seconds) | WORKER8(seconds) | WORKER9(seconds) | WORKER10(seconds) | WORKER11(seconds) | WORKER12(seconds) | WORKER13(seconds) | WORKER14(seconds) | WORKER15(seconds) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 287.2880573 | 1 | 287.2880573 | |||||||||||||||
2 | 174.1945976 | 1.649236321 | 171.6125228 | 174.1945976 | ||||||||||||||
4 | 89.76220971 | 3.200545733 | 89.7212379 | 89.76220971 | 89.15910661 | 88.05660012 | ||||||||||||
8 | 55.76332435 | 5.151917692 | 42.84291658 | 55.76332435 | 42.58375612 | 43.24533948 | 44.04461628 | 41.84724376 | 43.5917654 | 41.57216197 | ||||||||
16 | 32.00695248 | 8.975801663 | 22.85826961 | 21.71531868 | 23.15168015 | 23.01040534 | 22.99628973 | 22.96320085 | 22.58721086 | 23.12547255 | 23.48303775 | 32.00695248 | 22.98324096 | 22.46027581 | 22.49633149 | 22.81446465 | 27.68898551 | 23.36792783 |
On a single worker the time taken to do the training is 287 seconds while on 16 workers, the slowest worker takes 32 seconds. There is some variance in the runtime of workers as you can see from the data with the fastest worker finishing 10 seconds before the slowest worker. The speedup is 9X on 16 workers. Some of the possible causes of the imperfect scaling:
- number of output dimension is huge, leading to much more network traffic
- the dataset is much denser than the others
- we are comparing the slowest worker runtime for speedup, if we do the speedup calculation using the median runtime of the workers we get speedup of 12.5
NUM_MACHINES | SLOWEST RUNTIME AMONG WORKER(seconds) | SPEEDUP | WORKER0(seconds) | WORKER1(seconds) | WORKER2(seconds) | WORKER3(seconds) | WORKER4(seconds) | WORKER5(seconds) | WORKER6(seconds) | WORKER7(seconds) | WORKER8(seconds) | WORKER9(seconds) | WORKER10(seconds) | WORKER11(seconds) | WORKER12(seconds) | WORKER13(seconds) | WORKER14(seconds) | WORKER15(seconds) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 320.4591443 | 1 | 320.4591443 | |||||||||||||||
2 | 134.2140288 | 2.38767249 | 134.2140288 | 133.7781797 | ||||||||||||||
4 | 68.54574005 | 4.675113932 | 66.73764873 | 68.54574005 | 67.78707397 | 67.79657364 | ||||||||||||
8 | 37.26959534 | 8.598406862 | 37.09379182 | 35.75921435 | 37.26959534 | 35.10171499 | 36.77896743 | 34.71712542 | 35.5227128 | 35.78200464 | ||||||||
16 | 28.23276761 | 11.35061035 | 22.65901732 | 22.82429329 | 22.53910393 | 23.29732177 | 23.63758925 | 22.87697506 | 22.94470572 | 28.23276761 | 21.31139401 | 20.89695154 | 21.73933918 | 20.36472532 | 20.60589845 | 21.196763 | 20.7089105 | 21.11429569 |
On a single worker the time taken to do the training is 320 seconds while on 16 workers, the slowest worker takes 28 seconds. The difference between fastest and slowest worker is close to 8 seconds. The speedup is 11.5X on 16 workers.
NUM_MACHINES | SLOWEST RUNTIME AMONG WORKER(seconds) | SPEEDUP | WORKER0(seconds) | WORKER1(seconds) | WORKER2(seconds) | WORKER3(seconds) | WORKER4(seconds) | WORKER5(seconds) | WORKER6(seconds) | WORKER7(seconds) | WORKER8(seconds) | WORKER9(seconds) | WORKER10(seconds) | WORKER11(seconds) | WORKER12(seconds) | WORKER13(seconds) | WORKER14(seconds) | WORKER15(seconds) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 322.3493165 | 1 | 322.3493165 | |||||||||||||||
2 | 152.2661681 | 2.117012075 | 152.2661681 | 56.0047285 | ||||||||||||||
4 | 78.78860804 | 4.091318841 | 78.78860804 | 31.11844766 | 29.74631359 | 30.14538327 | ||||||||||||
8 | 38.6655634 | 8.336858128 | 38.6655634 | 17.77295431 | 16.99732262 | 17.93868786 | 17.35414038 | 16.68588871 | 17.0220588 | 16.74347107 | ||||||||
16 | 20.69872398 | 15.57339075 | 20.69872398 | 16.74933181 | 11.99427773 | 12.32047569 | 12.54608685 | 12.17584972 | 12.66074888 | 12.0352035 | 11.99997986 | 11.96278872 | 11.72612387 | 11.86328058 | 12.02945746 | 11.76552521 | 11.72321774 | 12.07822296 |
For kdda we get the best speedup of 15x for 16 workers. The difference between fastest and slowest workers for kdda is close to 9 seconds.
- WORKERi: The end to end runtime for WORKERi (for 10 epochs) averaged over multiple runs.
WORKER0(seconds) | WORKER1(seconds) | WORKER2(seconds) | WORKER3(seconds) | WORKER4(seconds) | WORKER5(seconds) | WORKER6(seconds) | WORKER7(seconds) | WORKER8(seconds) | WORKER9(seconds) | WORKER10(seconds) | WORKER11(seconds) | WORKER12(seconds) | WORKER13(seconds) | WORKER14(seconds) | WORKER15(seconds) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
305.645803928 | 332.092835903 | 334.780326128 | 334.998679161 | 343.547188997 | 344.171431065 | 343.63071084 | 337.439732075 | 349.839564085 | 351.485404968 | 354.654353857 | 346.672199011 | 357.655499935 | 366.253351927 | 361.332320929 | 374.920572042 |
There is a huge variance among workers when training on the full kdda dataset. The worker runtime between maximum and minimum workers differs by 70 seconds.