Batch size issue in multi-GPU system #413

dccho · 2015-11-11T11:31:18Z

When batch size is assigned by edit box in DIGITS UI, train network and test network have same batch size.
If certain batch size makes training network use 6GB of GPU memory, then test network also use 6GB of GPU memory in first GPU. It makes a problem when user use multi-GPUs. Only training networks are assigned other GPUs then GPUs except first one can only half of available memory. If user increases batch size, easily out-of-memory error is occurred because of first GPU even though available memory for other GPUs is enough.
Why don't you separate batch size field for training and test?
Another solution for this is that add batch_size field in customized network separately for training and test. However it operates differently for NVIDIA/caffe v0.13 and v0.14, In v0.13, each network in each GPU has (batch_size/#GPU) batch_size and in v0.14 each network in each GPU has same batch_size mentioned customized network.
Can you make any guideline document of batch_size for multi-GPU system?
p.s. prefetch field is also little bit strange but it's not big deal.

lukeyeager · 2015-11-11T22:08:44Z

However it operates differently for NVIDIA/caffe v0.13 and v0.14, In v0.13, each network in each GPU has (batch_size/#GPU) batch_size and in v0.14 each network in each GPU has same batch_size mentioned customized network.

That's a good point. We do need to address this.

lukeyeager · 2015-11-11T22:35:20Z

We've known this was going to be a problem.

Despite knowing that by default this is weak scaling, e.g. the specified batch size in the train_val.prototxt is multiplied by the number of GPUs you choose to run on, I forgot that when validating accuracy graphs. I still fear that is going to bite users.
BVLC/caffe#2870 (comment)

One way to solve this in DIGITS would be to force the v0.13 behavior when using v0.14 - i.e. divide the batch size by the number of GPUs automatically.

Another thing we could do is put a warning next to the batch size field explaining the situation. Unfortunately, the standard networks would still run out of memory when moving to multi-GPU.

/cc @gheinrich @thatguymike

gheinrich · 2015-11-12T09:19:57Z

There is possibly one downside to using the same batch size on all GPUs: you need to fit to the least-capable GPU. Is it conceivable to support several modes of operations like:

even: batchSize(i) = totalBatchSize / N
heuristic: batchSize(i) = Mem(i)*totalBatchSize/Sum_i(Mem(i))
max: each GPU is assigned whichever maximum batch size it can support (sounds slightly scary though and it might be somewhat impossible to control the learning rate in that case)
manual: user manually enters batch size to use on each GPU

Question on the learning rate that gets displayed in DIGITS: is it for the total aggregated batch size or for each GPU?

lukeyeager · 2015-11-30T18:09:45Z

Addressed for v0.14 with NVIDIA/caffe#78.

Change default behavior to strong scale. e.g. divde the train_val batch size by the number of GPUs, which is the solver count. NOTE, this is not completely clean because of the way Caffe conflates data layer batch with solver batch. We are making this change because of the astonishment of training having a different behavior when GPU count changes.

This change puts back Caffe-0.13 behavior

lukeyeager added the bug label Nov 11, 2015

lukeyeager closed this as completed Nov 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch size issue in multi-GPU system #413

Batch size issue in multi-GPU system #413

dccho commented Nov 11, 2015

lukeyeager commented Nov 11, 2015

lukeyeager commented Nov 11, 2015

gheinrich commented Nov 12, 2015

lukeyeager commented Nov 30, 2015

Batch size issue in multi-GPU system #413

Batch size issue in multi-GPU system #413

Comments

dccho commented Nov 11, 2015

lukeyeager commented Nov 11, 2015

lukeyeager commented Nov 11, 2015

gheinrich commented Nov 12, 2015

lukeyeager commented Nov 30, 2015