Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed contrastive loss layer to be the same as proposed in Hadsell et al 2006 #2321

Merged
merged 3 commits into from
May 12, 2015

Conversation

nickcarlevaris
Copy link

The current contrastive loss layer implements a slightly different loss than that proposed in Hadsell et all 2006. This PR updates it so that matches the original paper. This is in reference to issue #2308.

If d is the distance between two vectors describing a non-matching pair, the current code implements max(margin - d^2, 0) while the loss proposed by Hadsell et al is max(margin - d, 0)^2.

@SlevinKelevra and @melgor, you guys can give this PR a try and see if it works better than the current version.

@seanbell
Copy link

This is great -- I too was bothered that the implementation doesn't match, meaning that if you use caffe in a paper you need to clarify which version you're using. You also need to square the margin in the caffe version (to get the units to match), which can lead to subtle bugs.

A suggestion: rather than overwriting what's already there (which breaks already-saved models), why not add a prototxt parameter option to choose which version gets used?

@melgor
Copy link

melgor commented Apr 20, 2015

It works similar like current version. But I think that @nickcarlevaris version should be merged and the old one deleted. Mainly because there is no reference about current loss function, what may cause a lot of question: is this version better, who create it?
So, I think that only one version should exist, from Hadsell et all 2006

@seanbell
Copy link

@melgor I understand that the two losses are similar. However, I've already trained networks with the current version, and I would rather not have to maintain a difference with the master branch just to avoid having old functionality deleted. I'm okay with the new default becoming [Hadsell 2006], but I think the old functionality should be obtainable with an option in the prototxt.

@norouzi
Copy link
Contributor

norouzi commented Apr 21, 2015

Thanks @nickcarlevaris for the update!
I think there is a small problem in computing the gradient though, when squared distance (dist) becomes very close to zero. This causes the gradient to explode. I suggest adding a small value e.g. 1e-4 the denominator in the gradient, e.g., dividing by (dist + 1e-4).

@nickcarlevaris
Copy link
Author

@norouzi, I pushed a commit that adds the epsilon to prevent the possibility of dividing by zero.

@seanbell, my inclination would be to overwrite the current cost function with the correct one. When I originally implemented this layer (#959), my intention was to make it match the Hadsell et al paper. It is only because I didn't double check the original paper that I ended up implementing this slightly different cost function. I see this more as fixing an error.

As far as breaking existing models goes. changing the loss layer shouldn't break the deployed version of any models, because it would only affect the training net. Also, it should be pretty easy to fine tune an existing model from the old cost to this new cost, since a network that is good for one should be pretty good for the other.

That being said, I am not strongly against keeping both. Lets see what the caffe maintainers think would be better.

@seanbell
Copy link

@nickcarlevaris Ah, I didn't realize you were the one who contributed the original layer as well. I understand that this feels like a "bugfix", but the old code is a valid loss function -- just different. They might be equivalent in performance, maybe not. I agree that the new layer definition makes more sense, and intuitively feels like it should work better.

However, the layer has been around through an entire publication cycle by now. So I think we should treat this as deprecation and not "bugfix". That means keeping the old version and making the new version the default. For example, I already have a public preprint on visual similarity ( http://www.seanbell.ca/tmp/siggraph2015-preprint-bell.pdf -- section 3) that uses the old layer definition. It would be nice if others can reproduce our results. I bet there will be many more at CVPR this year that also use the old layer.

Anyway, sure, let's see what the caffe maintainers think.

@shelhamer
Copy link
Member

@nickcarlevaris @seanbell while this is a little tricky since the layer has already been adopted by existing work, I think it is best to

  1. switch the loss function to that published in Hadsell et al. 2006 as that was the original intention
  2. add an option to revert to the old behavior for "legacy" papers that made use of the other loss
  3. encourage authors to reproduce their results with the established Hadsell et al. 2006

if you agree. @nickcarlevaris could you add a field to contrastive_loss_param to pick the original or your concave variant? I defer the naming to you, since you created the loss.

Sorry for the wait and more so my apologies for not spotting the discrepancy in the equation the first time around.

@nickcarlevaris
Copy link
Author

@seanbell @shelhamer, I updated the PR as suggested. You can now get at the old behavior through a "legacy_version" parameter. Let me know if everything looks OK. I can rebase and squash if needed.

shelhamer added a commit that referenced this pull request May 12, 2015
Fixed contrastive loss layer to be the same as proposed in Hadsell et al 2006
@shelhamer shelhamer merged commit 2382b09 into BVLC:master May 12, 2015
@shelhamer
Copy link
Member

Thanks @nickcarlevaris!

@cancan101
Copy link

I think docs need to be updated:

/**
* @brief Computes the contrastive loss @f$
* E = \frac{1}{2N} \sum\limits_{n=1}^N \left(y\right) d +
* \left(1-y\right) \max \left(margin-d, 0\right)
* @f$ where @f$
* d = \left| \left| a_n - b_n \right| \right|_2^2 @f$. This can be
* used to train siamese networks.
*
* @param bottom input Blob vector (length 3)
* -# @f$ (N \times C \times 1 \times 1) @f$
* the features @f$ a \in [-\infty, +\infty]@f$
* -# @f$ (N \times C \times 1 \times 1) @f$
* the features @f$ b \in [-\infty, +\infty]@f$
* -# @f$ (N \times 1 \times 1 \times 1) @f$
* the binary similarity @f$ s \in [0, 1]@f$
* @param top output Blob vector (length 1)
* -# @f$ (1 \times 1 \times 1 \times 1) @f$
* the computed contrastive loss: @f$ E =
* \frac{1}{2N} \sum\limits_{n=1}^N \left(y\right) d +
* \left(1-y\right) \max \left(margin-d, 0\right)
* @f$ where @f$
* d = \left| \left| a_n - b_n \right| \right|_2^2 @f$.
* This can be used to train siamese networks.
*/

shelhamer added a commit that referenced this pull request Jul 30, 2015
make documented equation match the correct implementation of the
`max(margin - d, 0)^2` term in the loss. see #2321
@shelhamer
Copy link
Member

@cancan101 thanks for the report -- fixed in 7f70854.

matthiasplappert pushed a commit to matthiasplappert/caffe that referenced this pull request Aug 10, 2015
make documented equation match the correct implementation of the
`max(margin - d, 0)^2` term in the loss. see BVLC#2321
rokm pushed a commit to rokm/caffe that referenced this pull request Aug 10, 2015
make documented equation match the correct implementation of the
`max(margin - d, 0)^2` term in the loss. see BVLC#2321
cbfinn pushed a commit to cbfinn/caffe that referenced this pull request Aug 12, 2015
make documented equation match the correct implementation of the
`max(margin - d, 0)^2` term in the loss. see BVLC#2321
wangyida pushed a commit to wangyida/caffe that referenced this pull request Sep 14, 2015
make documented equation match the correct implementation of the
`max(margin - d, 0)^2` term in the loss. see BVLC#2321
wangyida pushed a commit to wangyida/caffe that referenced this pull request Sep 15, 2015
make documented equation match the correct implementation of the
`max(margin - d, 0)^2` term in the loss. see BVLC#2321
wangyida pushed a commit to wangyida/caffe that referenced this pull request Sep 15, 2015
make documented equation match the correct implementation of the
`max(margin - d, 0)^2` term in the loss. see BVLC#2321
wangyida pushed a commit to wangyida/caffe that referenced this pull request Sep 16, 2015
make documented equation match the correct implementation of the
`max(margin - d, 0)^2` term in the loss. see BVLC#2321
wangyida pushed a commit to wangyida/caffe that referenced this pull request Sep 22, 2015
make documented equation match the correct implementation of the
`max(margin - d, 0)^2` term in the loss. see BVLC#2321
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants