-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch size issue in multi-GPU system #413
Comments
That's a good point. We do need to address this. |
We've known this was going to be a problem.
One way to solve this in DIGITS would be to force the v0.13 behavior when using v0.14 - i.e. divide the batch size by the number of GPUs automatically. Another thing we could do is put a warning next to the batch size field explaining the situation. Unfortunately, the standard networks would still run out of memory when moving to multi-GPU. |
There is possibly one downside to using the same batch size on all GPUs: you need to fit to the least-capable GPU. Is it conceivable to support several modes of operations like:
Question on the learning rate that gets displayed in DIGITS: is it for the total aggregated batch size or for each GPU? |
Addressed for v0.14 with NVIDIA/caffe#78.
|
When batch size is assigned by edit box in DIGITS UI, train network and test network have same batch size.
If certain batch size makes training network use 6GB of GPU memory, then test network also use 6GB of GPU memory in first GPU. It makes a problem when user use multi-GPUs. Only training networks are assigned other GPUs then GPUs except first one can only half of available memory. If user increases batch size, easily out-of-memory error is occurred because of first GPU even though available memory for other GPUs is enough.
Why don't you separate batch size field for training and test?
Another solution for this is that add batch_size field in customized network separately for training and test. However it operates differently for NVIDIA/caffe v0.13 and v0.14, In v0.13, each network in each GPU has (batch_size/#GPU) batch_size and in v0.14 each network in each GPU has same batch_size mentioned customized network.
Can you make any guideline document of batch_size for multi-GPU system?
p.s. prefetch field is also little bit strange but it's not big deal.
The text was updated successfully, but these errors were encountered: