-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When sharing weights in a network, using the optim package loses half the gradient updates #156
Comments
Soumith, the solution to this is to also share the gradients: That's what I've been doing in my multiscale conv-nets. Since gradients are always accumulated (ans zero-ed by the user, typically before each sgd step), it works beautifully. Otherwise, getParameters() has no idea what you're doing, and it can't share things that are not explicitely shared! |
Ok, then the only solution is to not use the optim package for learning embeddings (which require two samples to have different gradient storages). For example when using HingeEmbeddingCriterion |
I don't understand – if you share the parameters, then you must share their associated gradients. That's the only constraint that getParameters() needs to return a pair of w + dE/dw that makes sense. In which case is this a problem? |
Lets say I am learning embeddings. I take a pair of samples, I define whether they are close to each other (+1) or far away from each other(-1). Then, you use the example mentioned in this section: It is basically cloning the network, but with different gradWeights (which makes sense as you dont want to accumulate the two samples' gradients, but update them separately). |
There is a convoluted way of still using the optim package. It was just the fact that I didn't know that one half of the gradients were being discarded that I thought of reporting it as a bug. |
No actually I think it works just fine, by sharing the gradients as well. Let's say you computed these guys separately: You impose w1==w2==w. Your update is going to be: w = w - lr * (dE/dw1 + dE/dw2) = w - lr * dE/dw So whether you add these two gradient vectors at update time, or while they're computed, it doesn't change anything. When you share the gradients, it just means they'll get accumulated. So in other words: you don't want to share the gradient wrt to the hidden units, but it's perfectly fine to share the gradients wrt the parameters. |
That makes sense. Thanks. |
OK cool! But thanks for posting about this: at the time we thought really hard about all the possible corner cases of flattening these parameters, but the scenarios are really endless, and we might still be missing some weird use cases. The bug on add() being silent on the other hand is quite bad :-(. |
When you share certain weights between two networks, but keep gradient storages different, to use with the optim package's optimizers (like sgd), you get the flattened weights with getParameters() which are inconsistent with what you expect.
mlp1=nn.Sequential()
mlp1:add(nn.Linear(10,10))
mlp2=mlp1:clone('weight','bias') -- cloned network
model=nn.ParallelTable()
model:add(mlp1)
model:add(mlp2)
w,gw=model:getParameters()
print(w:size())
110
[torch.LongStorage of size 1]
print(gw:size())
220
[torch.LongStorage of size 1]
We expect w:size() and gw:size() to be the same in this case. When optim.sgd updates the weights in the line
x:add(-clr, dxdw) then half of the gradients are discarded when using non-cuda versions (as add does not seem to check for number of elements)
The text was updated successfully, but these errors were encountered: