When sharing weights in a network, using the optim package loses half the gradient updates #156

soumith · 2013-07-14T17:11:45Z

When you share certain weights between two networks, but keep gradient storages different, to use with the optim package's optimizers (like sgd), you get the flattened weights with getParameters() which are inconsistent with what you expect.

mlp1=nn.Sequential()
mlp1:add(nn.Linear(10,10))
mlp2=mlp1:clone('weight','bias') -- cloned network
model=nn.ParallelTable()
model:add(mlp1)
model:add(mlp2)
w,gw=model:getParameters()
print(w:size())

110
[torch.LongStorage of size 1]

print(gw:size())

220
[torch.LongStorage of size 1]

We expect w:size() and gw:size() to be the same in this case. When optim.sgd updates the weights in the line
x:add(-clr, dxdw) then half of the gradients are discarded when using non-cuda versions (as add does not seem to check for number of elements)

clementfarabet · 2013-07-14T18:10:41Z

Soumith,

the solution to this is to also share the gradients:
mlp2=mlp1:clone('weight','bias','gradWeight','gradBias')

That's what I've been doing in my multiscale conv-nets. Since gradients are always accumulated (ans zero-ed by the user, typically before each sgd step), it works beautifully.

Otherwise, getParameters() has no idea what you're doing, and it can't share things that are not explicitely shared!

soumith · 2013-07-14T18:13:18Z

Ok, then the only solution is to not use the optim package for learning embeddings (which require two samples to have different gradient storages). For example when using HingeEmbeddingCriterion

clementfarabet · 2013-07-14T18:18:39Z

I don't understand – if you share the parameters, then you must share their associated gradients. That's the only constraint that getParameters() needs to return a pair of w + dE/dw that makes sense.

In which case is this a problem?

soumith · 2013-07-14T18:23:08Z

Lets say I am learning embeddings. I take a pair of samples, I define whether they are close to each other (+1) or far away from each other(-1).

Then, you use the example mentioned in this section:
http://www.torch.ch/manual/nn/index#hingeembeddingcriterion

It is basically cloning the network, but with different gradWeights (which makes sense as you dont want to accumulate the two samples' gradients, but update them separately).

soumith · 2013-07-14T18:27:17Z

There is a convoluted way of still using the optim package.
I don't clone the network.
I calculate the outputs of the network wrt the two inputs.
Manually calculate the pairwise distance.
forward and backward with the Criterion.
Zero Gradients of network
Backward w.r.t. the first error in the criterion.
Zero Gradients of network
Backward w.r.t. the second error in the criterion.

It was just the fact that I didn't know that one half of the gradients were being discarded that I thought of reporting it as a bug.

clementfarabet · 2013-07-14T18:38:18Z

No actually I think it works just fine, by sharing the gradients as well. Let's say you computed these guys separately:
w1, and dE/dw1
w2, and dE/dw2

You impose w1==w2==w. Your update is going to be: w = w - lr * (dE/dw1 + dE/dw2) = w - lr * dE/dw

So whether you add these two gradient vectors at update time, or while they're computed, it doesn't change anything. When you share the gradients, it just means they'll get accumulated.

So in other words: you don't want to share the gradient wrt to the hidden units, but it's perfectly fine to share the gradients wrt the parameters.

soumith · 2013-07-14T18:39:51Z

That makes sense. Thanks.

clementfarabet · 2013-07-14T18:42:01Z

OK cool! But thanks for posting about this: at the time we thought really hard about all the possible corner cases of flattening these parameters, but the scenarios are really endless, and we might still be missing some weird use cases.

The bug on add() being silent on the other hand is quite bad :-(.

clementfarabet closed this as completed Jul 14, 2013

clementfarabet reopened this Jul 14, 2013

soumith closed this as completed Jul 14, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When sharing weights in a network, using the optim package loses half the gradient updates #156

When sharing weights in a network, using the optim package loses half the gradient updates #156

soumith commented Jul 14, 2013

clementfarabet commented Jul 14, 2013

soumith commented Jul 14, 2013

clementfarabet commented Jul 14, 2013

soumith commented Jul 14, 2013

soumith commented Jul 14, 2013

clementfarabet commented Jul 14, 2013

soumith commented Jul 14, 2013

clementfarabet commented Jul 14, 2013

When sharing weights in a network, using the optim package loses half the gradient updates #156

When sharing weights in a network, using the optim package loses half the gradient updates #156

Comments

soumith commented Jul 14, 2013

clementfarabet commented Jul 14, 2013

soumith commented Jul 14, 2013

clementfarabet commented Jul 14, 2013

soumith commented Jul 14, 2013

soumith commented Jul 14, 2013

clementfarabet commented Jul 14, 2013

soumith commented Jul 14, 2013

clementfarabet commented Jul 14, 2013