Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When sharing weights in a network, using the optim package loses half the gradient updates #156

Closed
soumith opened this issue Jul 14, 2013 · 8 comments

Comments

@soumith
Copy link
Member

soumith commented Jul 14, 2013

When you share certain weights between two networks, but keep gradient storages different, to use with the optim package's optimizers (like sgd), you get the flattened weights with getParameters() which are inconsistent with what you expect.

mlp1=nn.Sequential()
mlp1:add(nn.Linear(10,10))
mlp2=mlp1:clone('weight','bias') -- cloned network
model=nn.ParallelTable()
model:add(mlp1)
model:add(mlp2)
w,gw=model:getParameters()
print(w:size())

110
[torch.LongStorage of size 1]

print(gw:size())

220
[torch.LongStorage of size 1]

We expect w:size() and gw:size() to be the same in this case. When optim.sgd updates the weights in the line
x:add(-clr, dxdw) then half of the gradients are discarded when using non-cuda versions (as add does not seem to check for number of elements)

@clementfarabet
Copy link
Member

Soumith,

the solution to this is to also share the gradients:
mlp2=mlp1:clone('weight','bias','gradWeight','gradBias')

That's what I've been doing in my multiscale conv-nets. Since gradients are always accumulated (ans zero-ed by the user, typically before each sgd step), it works beautifully.

Otherwise, getParameters() has no idea what you're doing, and it can't share things that are not explicitely shared!

@soumith
Copy link
Member Author

soumith commented Jul 14, 2013

Ok, then the only solution is to not use the optim package for learning embeddings (which require two samples to have different gradient storages). For example when using HingeEmbeddingCriterion

@clementfarabet
Copy link
Member

I don't understand – if you share the parameters, then you must share their associated gradients. That's the only constraint that getParameters() needs to return a pair of w + dE/dw that makes sense.

In which case is this a problem?

@soumith
Copy link
Member Author

soumith commented Jul 14, 2013

Lets say I am learning embeddings. I take a pair of samples, I define whether they are close to each other (+1) or far away from each other(-1).

Then, you use the example mentioned in this section:
http://www.torch.ch/manual/nn/index#hingeembeddingcriterion

It is basically cloning the network, but with different gradWeights (which makes sense as you dont want to accumulate the two samples' gradients, but update them separately).

@soumith
Copy link
Member Author

soumith commented Jul 14, 2013

There is a convoluted way of still using the optim package.
I don't clone the network.
I calculate the outputs of the network wrt the two inputs.
Manually calculate the pairwise distance.
forward and backward with the Criterion.
Zero Gradients of network
Backward w.r.t. the first error in the criterion.
Zero Gradients of network
Backward w.r.t. the second error in the criterion.

It was just the fact that I didn't know that one half of the gradients were being discarded that I thought of reporting it as a bug.

@clementfarabet
Copy link
Member

No actually I think it works just fine, by sharing the gradients as well. Let's say you computed these guys separately:
w1, and dE/dw1
w2, and dE/dw2

You impose w1==w2==w. Your update is going to be: w = w - lr * (dE/dw1 + dE/dw2) = w - lr * dE/dw

So whether you add these two gradient vectors at update time, or while they're computed, it doesn't change anything. When you share the gradients, it just means they'll get accumulated.

So in other words: you don't want to share the gradient wrt to the hidden units, but it's perfectly fine to share the gradients wrt the parameters.

@soumith
Copy link
Member Author

soumith commented Jul 14, 2013

That makes sense. Thanks.

@soumith soumith closed this as completed Jul 14, 2013
@clementfarabet
Copy link
Member

OK cool! But thanks for posting about this: at the time we thought really hard about all the possible corner cases of flattening these parameters, but the scenarios are really endless, and we might still be missing some weird use cases.

The bug on add() being silent on the other hand is quite bad :-(.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants