Multi-GPU Data Parallelism (with Parallel Data Layers) #2903

ronghanghu · 2015-08-11T08:34:26Z

This is my package of #2870 (and originally, #2114)

Modification: Allow data layers (and also PythonLayer when used as data layer) to be shared among worker solver's training net, and also test net for future-proof if one wants to do Multi-GPU testing. Data layers are locked during forward to ensure sequential forward. Now all worker solvers fetch data from one single data layer.

This ensure that single-gpu training is consistent with multi-gpu training, and allow tests in #2870 to pass. Otherwise in #2870 (#2114) , there are multiple data layers created for worker solver, and these data layers are unaware of each other. This can be a serious issue if one uses deterministic data layers or turn off shuffling. In such case, since data layers in each worker solver reads the same data, one eventually gets same gradient on each solver, so it is almost equivalent to multiply learning rate by GPU number. This is definitely not the desired behavior of Multi-GPU data parallelism, since one should train on different subsets of dataset. Although in #2114 a DataReader is provided, it only applied to leveldb and lmdb, and is hardly extensible to other data layers.

DataReader is preserved in this PR and LMDB/LEVELDB DataLayer is not shared.

TODOs

Add ShareInParallel function to layer.hpp, data_layer.hpp and pythonlayer.hpp .
Implement share layers during net construction, construct top blobs of shared layers.
Add lock to forward in layer.hpp to lock layers.
Share layers during workersolver construction.
~~Remove DataReader. Restore old behavior of DataLayer.~~ DataReader is kept.
Test make runtest on multiple GPU machine.
Test multi-gpu training on MNIST. (log: https://gist.github.com/ronghanghu/d66d63882c25b31b6148)
Test multi-gpu training on ILSVRC.
Fix NVCC warning on boost/thread.hpp to get Travis CI pass.

Drawback

Multi-GPU training is numerically non-deterministic on data layers excepted for LMDB/LEVELDB DataLayer, see #2903 (comment)

- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error

- Make sure each solver accesses a different subset of the data - Sequential reading of DB for performance - Prefetch a configurable amount of data to host memory - Distribute data to solvers in round-robin way for determinism

@thatguymike

thanks to discussion by @thatguymike and @flx42

@thatguymike

- Parallelize batches among GPUs and tree-reduce the gradients - The effective batch size scales with the number of devices - Batch size is multiplied by the number of devices - Split batches between GPUs, and tree-reduce the gradients - Detect machine topology (twin-GPU boards, P2P connectivity) - Track device in syncedmem (thanks @thatguymike) - Insert a callback in the solver for minimal code change - Accept list for gpu flag of caffe tool, e.g. '-gpu 0,1' or '-gpu all'. Run on default GPU if no ID given. - Add multi-GPU solver test - Deterministic architecture for reproducible runs

- Start with distant nodes in broadcast - Fix outside loop to loop for full tree depth

thatguymike · 2015-08-12T00:02:58Z

Well, tests pass, but training runs seem to hang in the data prefetch queue. Not sure the new datareader code is behaving.

ronghanghu · 2015-08-12T00:37:55Z

@thatguymike I'll look into this issue shortly, and see why training hang. I expect to do a rebase tonight and test on my data with multiple GPUs.

cypof · 2015-08-12T01:44:41Z

It's a great thing to get rid of the data reader and unify all data layer types. One thing I'm concerned about in this design though is about the ordering of threads on the lock. It might not be absolutely required, but if we want runs to be reproducible at the numerical precision level, each solver needs to take data items in the same order, which I don't believe the lock can enforce as it is. Each run might see items distributed to solvers differently. The gradient sum should be the same, but with slight differences as items would have been added in different order.

ronghanghu · 2015-08-12T01:55:03Z

Regarding Michael Houston's concern:

I wouldn’t be surprised if in the new code we are violating some internal assumption about LMDB thread access causing deadlocks, I have hit those before.

In this PR, only one single DataLayer is shared among all worker solvers. Since data in lmdb/leveldb is read in this DataLayer prefetch thread rather than worker solver thread, the data prefetch behavior doesn't deviate from single GPU.

ronghanghu · 2015-08-12T02:16:02Z

It might not be absolutely required, but if we want runs to be reproducible at the numerical precision level, each solver needs to take data items in the same order, which I don't believe the lock can enforce as it is.

@cypof I thought about this issue. However, I am not too concerned about that, since in general this PR produces more consistent and numerically same results for all other data layers (except for level/lmdb) than #2870.

In #2870 you'll get random behavior if a data layer supports and turns on shuffling, or get e.g. 4X learning rate otherwise. In both situation, the behavior is clearly worse than this PR and deviates from single GPU training+increased batch size. The latter behavior also defeats the purpose of MultiGPU data parallelism.

ronghanghu · 2015-08-12T08:01:07Z

Travis CI fails because NVCC generates warning over boost/thread.hpp included in layer.hpp (see Travis CI build details)

/home/travis/miniconda/include/boost/thread/pthread/thread_data.hpp(42): warning: controlling expression is constant

@shelhamer any suggestions to fix/suppress this warning?

ronghanghu · 2015-08-12T08:20:48Z

@thatguymike I made some update, removed data reader, and successfully trained on MNIST. I am also training on ILSVRC-2012-CLS with this PR.

Can you test again? Since data in lmdb/leveldb is read in this DataLayer prefetch thread rather than worker solver thread, the data prefetch behavior shouldn't deviate from single GPU.

thatguymike · 2015-08-12T15:25:27Z

Seems to work functionally, but scaling perf took a significant hit for some reason at 4 GPUs for AlexNet. Quite significant slowdown.

ronghanghu · 2015-08-12T15:35:38Z

@thatguymike I'll look into this today.

ronghanghu · 2015-08-12T16:03:49Z

@thatguymike To be specific, do you experience a lot of following logs?

I0812 08:57:39.468806 24173 blocking_queue.cpp:49] Data layer prefetch queue empty

cypof · 2015-08-12T16:05:54Z

How many transform threads are created by the shared data layer?

ronghanghu · 2015-08-12T16:14:33Z

@cypof There should be only one single prefetch thread in which transfom is performed. Only forward is done multi-thread in each solver via a lock.

@thatguymike looking into the drift issue you mentioned.

thatguymike · 2015-08-12T16:19:45Z

I am seeing a few notices of the data layer prefetch queue being empty that in theory I shouldn't be seeing. I don't see them with 2870 because I'm on fast SSDs and my LMDB should be in kernel filecache.

zxt881108 · 2015-08-23T05:14:29Z

@thatguymike Thanks! Just now, I use device id -gpu 0,2,4,6, the problem is partly solved , but the speedup ratio is too horrible(Googlenet, quick_solver, mini-batch=64, device_id=0 & iter20=9s, device_id=0,2 & iter20=12s，device_id=0,2,4,6 & iter20=23s), what about your speedup ratio on DIGITS devbox (4 Titan X)? Our server is Tyan B7079, GPUs are Titan X, CPUs are Intel E2650v3(x2), Memory is 32G DDR4(x24), the HARD DISKs are all SSD. It now seems there are still some problems about our server system bios, we have called the manufacturer, thanks again!

thatguymike · 2015-08-23T05:33:25Z

Remember that your effective batch size scales up as well, so your 2 device speedup doesn't look too bad, but clearly not great. Note from your P2Pbandwidth test results, your server has about half the bandwidth between boards as the DIGITS DevBox so you are going to be MUCH more communication bound on scaling that some other systems. I will note that issues with scaling performance and performance stability is exactly why my team designed the DevBox they way we did. You can replicate most of our build if you wish from online documents.

You can try larger batches to see how your performance changes, but something is up with your server. You might want to check the server logs for PCIe errors and definately check on system bios. You can also systematically try different combinations of devices to see if you can find the fast and slow pairs and then the fast and slow 4 boards. 8 boards on that machine because you have to cross the PCIe bridge is not going to perform well with the current code, if ever. (Especially as one of your links is only 1GB/s from your bandwidth test results)

You might also want to validate the scaling performance you achieve with AlexNet as there is more published work on that.

You are also running TitanX's in a server chassis at that density is likely not going to behave how you want in the long run without careful cooling design. (Note the modifications we had to make in DIGITS DevBox to keep 4 TitanX's thermally happy without crazy fan setups).

thatguymike · 2015-08-24T18:04:22Z

Okay, my numbers for GoogleNet with cudnnv3 on DIGITS DevBox (X99-E WS chipset and 4x TitanX)

Weak scaling (default behavior of master)
1 GPU: 7.9 sec/20 iterations
2 GPU: 8.3 sec/20 iterations
4 GPU: 11.3 sec/20 iterations

My P2P bidirectional perf:

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 128.03 26.16 20.32 20.32
1 26.17 127.99 20.31 20.32
2 20.32 20.32 127.91 26.14
3 20.31 20.32 26.14 127.49

zxt881108 · 2015-08-25T00:42:52Z

@thatguymike Thanks for you suggest, we have solved the p2p bandwidth problem between GPU id 0&1. The system bios version is too low, after update the version, the p2p bandwidth value seems normal

eldar · 2015-08-28T17:44:41Z

I tried it and got quite poor scaling:
1 GPU: 39 sec/20 iterations
2 GPU: 67 sec/20 iterations
I use custom data layer derived from WindowDataLayer. Version of Caffe is the latest from master. How can I profile what's going on there?

erogol · 2015-09-10T09:04:56Z

Is test iteration also distributed to gpus?

zxxmac · 2015-09-10T12:06:16Z

I want to predict image at caffe-window.But result all the same for different image,I don not konw how to predict.

在 2015年9月10日，17:05，Eren Golge notifications@github.com 写道：

Is test iteration also distributed to gpus?

—
Reply to this email directly or view it on GitHub.

ronghanghu · 2015-09-10T16:21:08Z

Test iterations are running on single GPU.

zxxmac · 2015-09-10T16:44:53Z

Yes it is single GPU

在 2015年9月11日，00:21，Ronghang Hu notifications@github.com 写道：

Test iterations are running on single GPU.

—
Reply to this email directly or view it on GitHub.

weiliu89 · 2016-02-17T16:35:02Z

@ronghanghu This PR is great!! Any hint on how to modify the code to do testing on Multi-GPU as well?

alfredox10 · 2016-09-02T02:02:43Z

Does this enable multi-GPU detection when executing?
prediction = net.forward()

cypof and others added 9 commits August 9, 2015 15:13

Add BlockingQueue for inter-thread communication

d94ca3f

Thread-local Caffe

45d792e

Change the way threads are started and stopped

73b3d13

- Interrupt the thread before waiting on join - Provide a method for looping threads to exit on demand - CHECK if start and stop succeed instead of returning an error

Persistent prefetch thread

ddcdc9d

Add DataReader for parallel training with one DB session

bcc8f50

- Make sure each solver accesses a different subset of the data - Sequential reading of DB for performance - Prefetch a configurable amount of data to host memory - Distribute data to solvers in round-robin way for determinism

Allocate host memory through cudaMallocHost

d2f0457

thanks to discussion by @thatguymike and @flx42

Detect topology corner cases and improve broadcast order

335bee7

- Start with distant nodes in broadcast - Fix outside loop to loop for full tree depth

[docs] add multi-gpu usage note to interfaces

8771d0f

ronghanghu force-pushed the multi_gpu branch from 20563cb to da04089 Compare August 11, 2015 08:57

shelhamer mentioned this pull request Aug 11, 2015

Multi-GPU #2870

Merged

10 tasks

shelhamer added in progress focus speed-up labels Aug 11, 2015

ronghanghu force-pushed the multi_gpu branch 3 times, most recently from 542e087 to 406448a Compare August 12, 2015 04:58

raingo mentioned this pull request Aug 26, 2015

A MultiGPU bug with multiple input layers #2977

Closed

longjon mentioned this pull request Aug 29, 2015

Fix a recently introduced race condition in DataLayer #2998

Merged

longjon mentioned this pull request Sep 8, 2015

Check labels in SoftmaxWithLoss #3043

Open

This was referenced Sep 12, 2015

Error with cudaFreeHost(ptr) in syncedmem.hpp:30 #3053

Closed

Get back 'USE CPU' print for caffe train #3074

Merged

lukeyeager mentioned this pull request Sep 21, 2015

Model of parallelism? NVIDIA/caffe#37

Closed

This was referenced Sep 26, 2015

Added timing info for iterations, data loading, and multi-GPU #2395

Closed

Data Reader #2386

Closed

jeffdonahue mentioned this pull request Oct 6, 2015

test_gradient_based_solver fails #3109

Closed

ronghanghu mentioned this pull request Oct 8, 2015

Deal with Non-Deterministic Behavior (Ensure Determinism?) #3168

Open

lukeyeager mentioned this pull request Oct 26, 2015

Digits2.0 :CUDA and CuDNN versions? NVIDIA/DIGITS#381

Closed

ronghanghu mentioned this pull request Oct 28, 2015

Multi-node caffe #3252

Closed

ronghanghu added the parallelism label Oct 28, 2015

futurely mentioned this pull request Oct 29, 2015

The case for splitting model class apache/mxnet#417

Closed

ronghanghu mentioned this pull request Nov 8, 2015

deadlock @ multiGPU caffe #3279

Closed

jeffdonahue mentioned this pull request Jan 27, 2016

Thread-specific singleton (from @cypof) #2067

Closed

This was referenced Feb 13, 2016

How does NVCaffe work on MultiGPUs? NVIDIA/caffe#48

Closed

How does NVCaffe MultiGPU Solver Works? NVIDIA/caffe#54

Closed

ih4cku mentioned this pull request Jul 31, 2016

caffe multiple card ih4cku/caffe-notes#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Data Parallelism (with Parallel Data Layers) #2903

Multi-GPU Data Parallelism (with Parallel Data Layers) #2903

ronghanghu commented Aug 11, 2015

thatguymike commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

cypof commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

thatguymike commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

cypof commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

thatguymike commented Aug 12, 2015

zxt881108 commented Aug 23, 2015

thatguymike commented Aug 23, 2015

thatguymike commented Aug 24, 2015

zxt881108 commented Aug 25, 2015

eldar commented Aug 28, 2015

erogol commented Sep 10, 2015

zxxmac commented Sep 10, 2015

ronghanghu commented Sep 10, 2015

zxxmac commented Sep 10, 2015

weiliu89 commented Feb 17, 2016

alfredox10 commented Sep 2, 2016

Multi-GPU Data Parallelism (with Parallel Data Layers) #2903

Multi-GPU Data Parallelism (with Parallel Data Layers) #2903

Conversation

ronghanghu commented Aug 11, 2015

TODOs

Drawback

thatguymike commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

cypof commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

thatguymike commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

cypof commented Aug 12, 2015

ronghanghu commented Aug 12, 2015

thatguymike commented Aug 12, 2015

zxt881108 commented Aug 23, 2015

thatguymike commented Aug 23, 2015

thatguymike commented Aug 24, 2015

zxt881108 commented Aug 25, 2015

eldar commented Aug 28, 2015

erogol commented Sep 10, 2015

zxxmac commented Sep 10, 2015

ronghanghu commented Sep 10, 2015

zxxmac commented Sep 10, 2015

weiliu89 commented Feb 17, 2016

alfredox10 commented Sep 2, 2016