Non-deterministic data reading of image_data_layer in parallel training #4590

AIROBOTAI · 2016-08-14T18:24:47Z

Hi all, I have a question about the deterministic batch input of image_data_layer when doing parallel training. Suppose we have a dataset which only contains four batches named A, B, C, D, respectively. And we have 4 solvers (S1,S2,S3,S4) by using 4 GPUs. We also suppose that the dataset will not be randomly shuffled during training. I have checked the implementation of BasePrefetchingDataLayer to find it is only guaranteed that different solvers get their input batch sequentially but not in fixed order. Then I wonder we may encounter the following problem: at T-th iteration, the input batch for S1~~S4 may be A, B,C, D, respectively, but at the next iteration, it is quite probable the input batch for S1~~S4 might become B, C, A, D or something else. Such non-deterministic behavior may be dangerous in some cases. Could anyone kindly tell me whether my above doubt are correct?

Besides, could anyone please explain to me why the following sentences "using Params::size_; using Params::data_; using Params::diff_;" are used in the definition of classes: GPUParams and P2PSync (defined in parallel.hpp)? Personally, the using-declarations are generally to solve the problem that members in base class are shadowed in derived class, which however seems not the case for GPUParams and P2PSync. Therefore, I wonder if these sentences are necessary.

Thanks in advance!

cypof · 2016-08-15T18:10:29Z

Currently only the LMDB and LevelDB layers are deterministic, by going through data_reader. We are trying to fix the other layers by switching to a skip approach like in #4563. The using statements are only there to avoid having to type this-> every time. C++ should allow direct access to protected fields, but it doesn't for templated classes for some reason.

AIROBOTAI · 2016-08-16T14:42:58Z

@cypof Thanks a lot for your kind help! Got it now.

shelhamer · 2017-04-14T17:41:29Z

The determinism of data loading is fixed in the new parallelism #4563.

shelhamer closed this as completed Apr 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic data reading of image_data_layer in parallel training #4590

Non-deterministic data reading of image_data_layer in parallel training #4590

AIROBOTAI commented Aug 14, 2016

cypof commented Aug 15, 2016

AIROBOTAI commented Aug 16, 2016

shelhamer commented Apr 14, 2017

Non-deterministic data reading of image_data_layer in parallel training #4590

Non-deterministic data reading of image_data_layer in parallel training #4590

Comments

AIROBOTAI commented Aug 14, 2016

cypof commented Aug 15, 2016

AIROBOTAI commented Aug 16, 2016

shelhamer commented Apr 14, 2017