Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Data Parallelism (with Parallel Data Layers) #2903

Merged
merged 11 commits into from
Aug 13, 2015

Conversation

ronghanghu
Copy link
Member

This is my package of #2870 (and originally, #2114)

Modification: Allow data layers (and also PythonLayer when used as data layer) to be shared among worker solver's training net, and also test net for future-proof if one wants to do Multi-GPU testing. Data layers are locked during forward to ensure sequential forward. Now all worker solvers fetch data from one single data layer.

This ensure that single-gpu training is consistent with multi-gpu training, and allow tests in #2870 to pass. Otherwise in #2870 (#2114) , there are multiple data layers created for worker solver, and these data layers are unaware of each other. This can be a serious issue if one uses deterministic data layers or turn off shuffling. In such case, since data layers in each worker solver reads the same data, one eventually gets same gradient on each solver, so it is almost equivalent to multiply learning rate by GPU number. This is definitely not the desired behavior of Multi-GPU data parallelism, since one should train on different subsets of dataset. Although in #2114 a DataReader is provided, it only applied to leveldb and lmdb, and is hardly extensible to other data layers.

DataReader is preserved in this PR and LMDB/LEVELDB DataLayer is not shared.

TODOs

  • Add ShareInParallel function to layer.hpp, data_layer.hpp and pythonlayer.hpp .
  • Implement share layers during net construction, construct top blobs of shared layers.
  • Add lock to forward in layer.hpp to lock layers.
  • Share layers during workersolver construction.
  • Remove DataReader. Restore old behavior of DataLayer. DataReader is kept.
  • Test make runtest on multiple GPU machine.
  • Test multi-gpu training on MNIST. (log: https://gist.github.com/ronghanghu/d66d63882c25b31b6148)
  • Test multi-gpu training on ILSVRC.
  • Fix NVCC warning on boost/thread.hpp to get Travis CI pass.

Drawback

Multi-GPU training is numerically non-deterministic on data layers excepted for LMDB/LEVELDB DataLayer, see #2903 (comment)

cypof and others added 9 commits August 9, 2015 15:13
- Interrupt the thread before waiting on join
- Provide a method for looping threads to exit on demand
- CHECK if start and stop succeed instead of returning an error
- Make sure each solver accesses a different subset of the data
- Sequential reading of DB for performance
- Prefetch a configurable amount of data to host memory
- Distribute data to solvers in round-robin way for determinism
- Parallelize batches among GPUs and tree-reduce the gradients
- The effective batch size scales with the number of devices
- Batch size is multiplied by the number of devices
- Split batches between GPUs, and tree-reduce the gradients
- Detect machine topology (twin-GPU boards, P2P connectivity)
- Track device in syncedmem (thanks @thatguymike)
- Insert a callback in the solver for minimal code change
- Accept list for gpu flag of caffe tool, e.g. '-gpu 0,1' or '-gpu all'.
  Run on default GPU if no ID given.
- Add multi-GPU solver test
- Deterministic architecture for reproducible runs
- Start with distant nodes in broadcast
- Fix outside loop to loop for full tree depth
@thatguymike
Copy link
Contributor

Well, tests pass, but training runs seem to hang in the data prefetch queue. Not sure the new datareader code is behaving.

@ronghanghu
Copy link
Member Author

@thatguymike I'll look into this issue shortly, and see why training hang. I expect to do a rebase tonight and test on my data with multiple GPUs.

@cypof
Copy link
Member

cypof commented Aug 12, 2015

It's a great thing to get rid of the data reader and unify all data layer types. One thing I'm concerned about in this design though is about the ordering of threads on the lock. It might not be absolutely required, but if we want runs to be reproducible at the numerical precision level, each solver needs to take data items in the same order, which I don't believe the lock can enforce as it is. Each run might see items distributed to solvers differently. The gradient sum should be the same, but with slight differences as items would have been added in different order.

@ronghanghu
Copy link
Member Author

Regarding Michael Houston's concern:

I wouldn’t be surprised if in the new code we are violating some internal assumption about LMDB thread access causing deadlocks, I have hit those before.

In this PR, only one single DataLayer is shared among all worker solvers. Since data in lmdb/leveldb is read in this DataLayer prefetch thread rather than worker solver thread, the data prefetch behavior doesn't deviate from single GPU.

@ronghanghu
Copy link
Member Author

It might not be absolutely required, but if we want runs to be reproducible at the numerical precision level, each solver needs to take data items in the same order, which I don't believe the lock can enforce as it is.

@cypof I thought about this issue. However, I am not too concerned about that, since in general this PR produces more consistent and numerically same results for all other data layers (except for level/lmdb) than #2870.

In #2870 you'll get random behavior if a data layer supports and turns on shuffling, or get e.g. 4X learning rate otherwise. In both situation, the behavior is clearly worse than this PR and deviates from single GPU training+increased batch size. The latter behavior also defeats the purpose of MultiGPU data parallelism.

@ronghanghu ronghanghu force-pushed the multi_gpu branch 3 times, most recently from 542e087 to 406448a Compare August 12, 2015 04:58
@ronghanghu
Copy link
Member Author

Travis CI fails because NVCC generates warning over boost/thread.hpp included in layer.hpp (see Travis CI build details)

/home/travis/miniconda/include/boost/thread/pthread/thread_data.hpp(42): warning: controlling expression is constant

@shelhamer any suggestions to fix/suppress this warning?

@ronghanghu
Copy link
Member Author

@thatguymike I made some update, removed data reader, and successfully trained on MNIST. I am also training on ILSVRC-2012-CLS with this PR.

Can you test again? Since data in lmdb/leveldb is read in this DataLayer prefetch thread rather than worker solver thread, the data prefetch behavior shouldn't deviate from single GPU.

@thatguymike
Copy link
Contributor

Seems to work functionally, but scaling perf took a significant hit for some reason at 4 GPUs for AlexNet. Quite significant slowdown.

@ronghanghu
Copy link
Member Author

@thatguymike I'll look into this today.

@ronghanghu
Copy link
Member Author

@thatguymike To be specific, do you experience a lot of following logs?

I0812 08:57:39.468806 24173 blocking_queue.cpp:49] Data layer prefetch queue empty

@cypof
Copy link
Member

cypof commented Aug 12, 2015

How many transform threads are created by the shared data layer?

@ronghanghu
Copy link
Member Author

@cypof There should be only one single prefetch thread in which transfom is performed. Only forward is done multi-thread in each solver via a lock.

@thatguymike looking into the drift issue you mentioned.

@thatguymike
Copy link
Contributor

I am seeing a few notices of the data layer prefetch queue being empty that in theory I shouldn't be seeing. I don't see them with 2870 because I'm on fast SSDs and my LMDB should be in kernel filecache.

@zxt881108
Copy link

@thatguymike Thanks! Just now, I use device id -gpu 0,2,4,6, the problem is partly solved , but the speedup ratio is too horrible(Googlenet, quick_solver, mini-batch=64, device_id=0 & iter20=9s, device_id=0,2 & iter20=12s,device_id=0,2,4,6 & iter20=23s), what about your speedup ratio on DIGITS devbox (4 Titan X)? Our server is Tyan B7079, GPUs are Titan X, CPUs are Intel E2650v3(x2), Memory is 32G DDR4(x24), the HARD DISKs are all SSD. It now seems there are still some problems about our server system bios, we have called the manufacturer, thanks again!

@thatguymike
Copy link
Contributor

Remember that your effective batch size scales up as well, so your 2 device speedup doesn't look too bad, but clearly not great. Note from your P2Pbandwidth test results, your server has about half the bandwidth between boards as the DIGITS DevBox so you are going to be MUCH more communication bound on scaling that some other systems. I will note that issues with scaling performance and performance stability is exactly why my team designed the DevBox they way we did. You can replicate most of our build if you wish from online documents.

You can try larger batches to see how your performance changes, but something is up with your server. You might want to check the server logs for PCIe errors and definately check on system bios. You can also systematically try different combinations of devices to see if you can find the fast and slow pairs and then the fast and slow 4 boards. 8 boards on that machine because you have to cross the PCIe bridge is not going to perform well with the current code, if ever. (Especially as one of your links is only 1GB/s from your bandwidth test results)

You might also want to validate the scaling performance you achieve with AlexNet as there is more published work on that.

You are also running TitanX's in a server chassis at that density is likely not going to behave how you want in the long run without careful cooling design. (Note the modifications we had to make in DIGITS DevBox to keep 4 TitanX's thermally happy without crazy fan setups).

@thatguymike
Copy link
Contributor

Okay, my numbers for GoogleNet with cudnnv3 on DIGITS DevBox (X99-E WS chipset and 4x TitanX)

Weak scaling (default behavior of master)
1 GPU: 7.9 sec/20 iterations
2 GPU: 8.3 sec/20 iterations
4 GPU: 11.3 sec/20 iterations

My P2P bidirectional perf:

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 128.03 26.16 20.32 20.32
1 26.17 127.99 20.31 20.32
2 20.32 20.32 127.91 26.14
3 20.31 20.32 26.14 127.49

@zxt881108
Copy link

@thatguymike Thanks for you suggest, we have solved the p2p bandwidth problem between GPU id 0&1. The system bios version is too low, after update the version, the p2p bandwidth value seems normal
234

@eldar
Copy link

eldar commented Aug 28, 2015

I tried it and got quite poor scaling:
1 GPU: 39 sec/20 iterations
2 GPU: 67 sec/20 iterations
I use custom data layer derived from WindowDataLayer. Version of Caffe is the latest from master. How can I profile what's going on there?

@erogol
Copy link
Contributor

erogol commented Sep 10, 2015

Is test iteration also distributed to gpus?

@zxxmac
Copy link

zxxmac commented Sep 10, 2015

I want to predict image at caffe-window.But result all the same for different image,I don not konw how to predict.

在 2015年9月10日,17:05,Eren Golge notifications@github.com 写道:

Is test iteration also distributed to gpus?


Reply to this email directly or view it on GitHub.

@ronghanghu
Copy link
Member Author

Test iterations are running on single GPU.

@zxxmac
Copy link

zxxmac commented Sep 10, 2015

Yes it is single GPU

在 2015年9月11日,00:21,Ronghang Hu notifications@github.com 写道:

Test iterations are running on single GPU.


Reply to this email directly or view it on GitHub.

@weiliu89
Copy link

@ronghanghu This PR is great!! Any hint on how to modify the code to do testing on Multi-GPU as well?

@alfredox10
Copy link

Does this enable multi-GPU detection when executing?
prediction = net.forward()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.