Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch size issue in multi-GPU system #413

Closed
dccho opened this issue Nov 11, 2015 · 4 comments
Closed

Batch size issue in multi-GPU system #413

dccho opened this issue Nov 11, 2015 · 4 comments
Labels

Comments

@dccho
Copy link

dccho commented Nov 11, 2015

When batch size is assigned by edit box in DIGITS UI, train network and test network have same batch size.
If certain batch size makes training network use 6GB of GPU memory, then test network also use 6GB of GPU memory in first GPU. It makes a problem when user use multi-GPUs. Only training networks are assigned other GPUs then GPUs except first one can only half of available memory. If user increases batch size, easily out-of-memory error is occurred because of first GPU even though available memory for other GPUs is enough.
Why don't you separate batch size field for training and test?
Another solution for this is that add batch_size field in customized network separately for training and test. However it operates differently for NVIDIA/caffe v0.13 and v0.14, In v0.13, each network in each GPU has (batch_size/#GPU) batch_size and in v0.14 each network in each GPU has same batch_size mentioned customized network.
Can you make any guideline document of batch_size for multi-GPU system?
p.s. prefetch field is also little bit strange but it's not big deal.

@lukeyeager lukeyeager added the bug label Nov 11, 2015
@lukeyeager
Copy link
Member

However it operates differently for NVIDIA/caffe v0.13 and v0.14, In v0.13, each network in each GPU has (batch_size/#GPU) batch_size and in v0.14 each network in each GPU has same batch_size mentioned customized network.

That's a good point. We do need to address this.

@lukeyeager
Copy link
Member

We've known this was going to be a problem.

Despite knowing that by default this is weak scaling, e.g. the specified batch size in the train_val.prototxt is multiplied by the number of GPUs you choose to run on, I forgot that when validating accuracy graphs. I still fear that is going to bite users.
BVLC/caffe#2870 (comment)

One way to solve this in DIGITS would be to force the v0.13 behavior when using v0.14 - i.e. divide the batch size by the number of GPUs automatically.

Another thing we could do is put a warning next to the batch size field explaining the situation. Unfortunately, the standard networks would still run out of memory when moving to multi-GPU.

/cc @gheinrich @thatguymike

@gheinrich
Copy link
Contributor

There is possibly one downside to using the same batch size on all GPUs: you need to fit to the least-capable GPU. Is it conceivable to support several modes of operations like:

  • even: batchSize(i) = totalBatchSize / N
  • heuristic: batchSize(i) = Mem(i)*totalBatchSize/Sum_i(Mem(i))
  • max: each GPU is assigned whichever maximum batch size it can support (sounds slightly scary though and it might be somewhat impossible to control the learning rate in that case)
  • manual: user manually enters batch size to use on each GPU

Question on the learning rate that gets displayed in DIGITS: is it for the total aggregated batch size or for each GPU?

@lukeyeager
Copy link
Member

Addressed for v0.14 with NVIDIA/caffe#78.

Change default behavior to strong scale. e.g. divde the train_val batch size by the number of GPUs, which is the solver count. NOTE, this is not completely clean because of the way Caffe conflates data layer batch with solver batch. We are making this change because of the astonishment of training having a different behavior when GPU count changes.

This change puts back Caffe-0.13 behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants