Decouple the computational batch size and minibatch size by accumulating gradients #1663

longjon · 2014-12-31T07:11:34Z

After #1615, so that this code already supports deconv layer. (The actual diff is just +37/-40 lines.)

This PRs the gradient accumulation branch living at https://github.com/shelhamer/caffe/tree/accum-grad. I took a lighter approach here than the one there: parameter gradients are always accumulated, there is no other option. The gradient checker is made correct by zero-initing parameter diffs.

Issues:

This changes the behavior of Backward. External code that used Backward is likely to break, if there is any.
I think this breaks solvers other than SGDSolver, but haven't thought carefully about that yet.

Decouple the computational batch size and minibatch size by accumulating gradients

(With layers whose backwards accumlate gradients), this effectively decouples the computational batch from the SGD minibatch. Each iteration accumulates gradients over iter_size batches, then parameters are updated.

jeffdonahue · 2015-01-20T01:31:33Z

Have we thought about how to handle the case when we're sharing parameters but using different learning rates? I would be okay with simply disallowing that case since it would probably be a pretty weird thing to do. Otherwise the only other way I can think to handle it is pretty messy -- we could have a a special case where, e.g. if blobs_lr is 2 in one layer but 1 in all others, the Net could prescale (by a factor of 2) the top_diff for the layer with blobs_lr 2 by 2... Actually, even that wouldn't work if the layer has other shared param blobs that don't also have the same relative LR...

Decouple the computational batch size and minibatch size by accumulating gradients

shelhamer · 2015-02-03T20:22:59Z

Always accumulating is simple and good, but let's review the weight sharing and solvers issues before merging.

shelhamer · 2015-02-26T00:34:12Z

Replaced by #1977.

longjon added a commit to longjon/caffe that referenced this pull request Dec 31, 2014

Merge pull request BVLC#1663 from longjon/accum-grad

d992637

Decouple the computational batch size and minibatch size by accumulating gradients

longjon force-pushed the accum-grad branch 4 times, most recently from a4d2e6d to d76653a Compare December 31, 2014 22:53

longjon added a commit to longjon/caffe that referenced this pull request Dec 31, 2014

Merge pull request BVLC#1663 from longjon/accum-grad

eadb3dd

Decouple the computational batch size and minibatch size by accumulating gradients

longjon added a commit to longjon/caffe that referenced this pull request Jan 1, 2015

Merge pull request BVLC#1663 from longjon/accum-grad

58e0d33

Decouple the computational batch size and minibatch size by accumulating gradients

longjon added a commit to longjon/caffe that referenced this pull request Jan 2, 2015

Merge pull request BVLC#1663 from longjon/accum-grad

d389945

Decouple the computational batch size and minibatch size by accumulating gradients

longjon added a commit to longjon/caffe that referenced this pull request Jan 2, 2015

Merge pull request BVLC#1663 from longjon/accum-grad

45f0bbf

Decouple the computational batch size and minibatch size by accumulating gradients

longjon added a commit to longjon/caffe that referenced this pull request Jan 3, 2015

Merge pull request BVLC#1663 from longjon/accum-grad

5e2c3fa

Decouple the computational batch size and minibatch size by accumulating gradients

longjon mentioned this pull request Jan 3, 2015

Refactor Solver to allow interactive stepping #1228

Merged

longjon added a commit to longjon/caffe that referenced this pull request Jan 3, 2015

Merge pull request BVLC#1663 from longjon/accum-grad

3f27be2

Decouple the computational batch size and minibatch size by accumulating gradients

longjon force-pushed the accum-grad branch from d76653a to 696507d Compare January 11, 2015 08:25

longjon and others added 5 commits January 11, 2015 00:31

zero-init param diffs and accumulate gradients

eb39d4e

(With layers whose backwards accumlate gradients), this effectively decouples the computational batch from the SGD minibatch. Each iteration accumulates gradients over iter_size batches, then parameters are updated.

zero-init param diffs in gradient checker

9aadcf2

accumulate gradients in inner product layer

c49ba16

accumulate gradients in (de)conv layers

64747c4

accumulate gradients in cudnn conv layer

08e21e7

longjon force-pushed the accum-grad branch from 696507d to 08e21e7 Compare January 11, 2015 08:35

philkr added a commit to philkr/caffe that referenced this pull request Jan 25, 2015

Merge pull request BVLC#1663 from longjon/accum-grad

f88a80f

Decouple the computational batch size and minibatch size by accumulating gradients

shelhamer added focus in progress labels Jan 28, 2015

This was referenced Feb 15, 2015

EmbedLayer #1872

Closed

Unrolled recurrent layers (RNN, LSTM) #1873

Closed

This was referenced Feb 21, 2015

Mini-batch Size vs. Memory Limit #1929

Closed

Decouple the computational batch size and minibatch size by accumulating gradients #1977

Merged

shelhamer closed this Feb 26, 2015

jeffdonahue mentioned this pull request Mar 4, 2015

Embed layer #2032

Merged

wx405557858 mentioned this pull request Aug 23, 2016

Keras update after single batch which exceeds the GPU memory keras-team/keras#3556

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple the computational batch size and minibatch size by accumulating gradients #1663

Decouple the computational batch size and minibatch size by accumulating gradients #1663

longjon commented Dec 31, 2014

jeffdonahue commented Jan 20, 2015

shelhamer commented Feb 3, 2015

shelhamer commented Feb 26, 2015

Decouple the computational batch size and minibatch size by accumulating gradients #1663

Decouple the computational batch size and minibatch size by accumulating gradients #1663

Conversation

longjon commented Dec 31, 2014

jeffdonahue commented Jan 20, 2015

shelhamer commented Feb 3, 2015

shelhamer commented Feb 26, 2015