Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to install caffe-future. #1

Closed
ghost opened this issue Jul 29, 2015 · 58 comments
Closed

Unable to install caffe-future. #1

ghost opened this issue Jul 29, 2015 · 58 comments

Comments

@ghost
Copy link

ghost commented Jul 29, 2015

Hi,

I found out about Caffe-future from the paper Fully Convolutional Neural Networks (found the link from model-zoo). I am trying to work on a regression problem where the input to the CNN is an 256 X 256 image and the output that the CNN is supposed to produce is also an 256 X 256 image. So a version of caffe that supports Fully convolutional neural netwroks would be extremely useful for me. In the original version of caffe I was getting an error when I tried setting the stride of the convolutional layer to a float value (for upsampling). I believe the caffe-future version supports float value for stride.

However, while trying to install caffe-future I am facing some issues. I am not sure if I am missing anything. Following is what I tried for installation:

First I cloned the git repository. After that I followed the instructions mentioned in future.sh
Mentioned below are the commands I wrote and the outputs I got. The main issue I faced was in the command : hub merge BVLC#1977 which gave the error : fatal: Couldn't find remote ref refs/heads/accum-grad

>>> git clone https://github.com/longjon/caffe.git
>>> cd caffe

>>> git checkout master

Already on 'master'
Your branch is up-to-date with 'origin/master'.

>>> git branch -D future

error: branch 'future' not found.

>>> git checkout -b future

Switched to a new branch 'future'

>>> hub merge https://github.com/BVLC/caffe/pull/1976

include/caffe/util/benchmark.hpp | 27 +-
include/caffe/util/coords.hpp | 61 +
include/caffe/util/cudnn.hpp | 128 +
include/caffe/util/db.hpp | 190 ++
include/caffe/util/device_alternate.hpp | 102 +
include/caffe/util/im2col.hpp | 22 +-
...
...
...
matlab/caffe/matcaffe_init.m | 11 +-
.../bvlc_alexnet/deploy.prototxt | 248 +-
models/bvlc_alexnet/readme.md | 25 +
.../bvlc_alexnet/solver.prototxt | 6 +-
.../bvlc_alexnet/train_val.prototxt | 296 ++-
models/bvlc_googlenet/deploy.prototxt | 2156 +++++++++++++++++
...
...
...
tools/test_net.cpp | 54 +-
tools/train_net.cpp | 34 +-
tools/upgrade_net_proto_binary.cpp | 17 +-
tools/upgrade_net_proto_text.cpp | 29 +-
430 files changed, 46179 insertions(+), 11932 deletions(-)
create mode 100644 .Doxyfile
create mode 100644 .travis.yml
create mode 100644 CMakeLists.txt
create mode 100644 cmake/ConfigGen.cmake
create mode 100644 cmake/Cuda.cmake
create mode 100644 cmake/Dependencies.cmake
...
...
...
create mode 100644 src/caffe/util/db.cpp
create mode 100644 src/gtest/CMakeLists.txt
create mode 100644 tools/CMakeLists.txt
create mode 100644 tools/caffe.cpp
delete mode 100644 tools/dump_network.cpp
create mode 100755 tools/extra/parse_log.py

>>> hub merge https://github.com/BVLC/caffe/pull/1977

fatal: Couldn't find remote ref refs/heads/accum-grad

>>> hub merge https://github.com/BVLC/caffe/pull/2086

From git://github.com/longjon/caffe
[new branch] python-net-spec -> longjon/python-net-spec
Auto-merging src/caffe/net.cpp
Removing src/caffe/layers/flatten_layer.cu
Auto-merging matlab/hdf5creation/demo.m
Removing matlab/caffe/read_cell.m
Removing matlab/caffe/print_cell.m
Removing matlab/caffe/prepare_batch.m
Removing matlab/caffe/matcaffe_init.m
Removing matlab/caffe/matcaffe_demo_vgg_mean_pix.m
Removing matlab/caffe/matcaffe_demo_vgg.m
Removing matlab/caffe/matcaffe_demo.m
Removing matlab/caffe/matcaffe_batch.m
Removing matlab/caffe/matcaffe.cpp
Removing matlab/caffe/ilsvrc_2012_mean.mat
Auto-merging include/caffe/vision_layers.hpp
CONFLICT (content): Merge conflict in include/caffe/vision_layers.hpp
Auto-merging include/caffe/neuron_layers.hpp
Auto-merging include/caffe/layer.hpp
Auto-merging include/caffe/common_layers.hpp
Auto-merging examples/net_surgery/bvlc_caffenet_full_conv.prototxt
Automatic merge failed; fix conflicts and then commit the result.

I am unable to compile caffe. Can someone please help me with this issue ?

@kashefy
Copy link

kashefy commented Aug 3, 2015

@aalok1969, the compilation error you're getting is from a conflict in the vision_layers header. Specifically the class definitions for the CropLayer class that got tangled up with the definitions of the SPPLayer class that got merged before @shelhamer submitted his crop-layer PR#1976.

I recently ran into the same problem as you and tried to resolve the conflict with PR #2 to @shelhamer's crop-layer branch. He hasn't responded yet. The PR is basically a merge of a long list of changes from the BVLC:master to bring shelhamer:crop-layer up to speed on changes there, in addition to 2 commits for resolving the SPPLayer-CropLayer class definition conflict (8ebd41b and fa0cbb2). No logical or functional changes.

If you plan on checking out my PR, can you comment on whether you were able to reproduce the FCN experiments? Thanks.

@aalok1993
Copy link

Hi @kashefy, thanks a lot for your reply. I tried your version of caffe by running

git clone https://github.com/kashefy/caffe

And then I compiled caffe, everything went smoothly.
But when I tried to train the network, I got the following errors. This error is due to the fact that, in one of my convolutional layers the stride is a float value of 0.5. Hence, it is giving the error Expected integer. I want to be able to set the stride as a float value in order to upscale the output. How could I be able to do that ?

[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format     caffe.NetParameter: 69:13: Expected integer.
F0804 18:13:11.194710 31449 upgrade_proto.cpp:928] Check failed:
ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /home/aalok/caffe-kashefy/caffe/Over-exposure/MyNet_net.prototxt
*** Check failure stack trace: ***
    @     0x7fd198941daa  (unknown)
    @     0x7fd198941ce4  (unknown)
    @     0x7fd1989416e6  (unknown)
    @     0x7fd198944687  (unknown)
    @     0x7fd198d8d1ae  caffe::ReadNetParamsFromTextFileOrDie()
    @     0x7fd198d7c822  caffe::Solver<>::InitTrainNet()
    @     0x7fd198d7d713  caffe::Solver<>::Init()
    @     0x7fd198d7d8e6  caffe::Solver<>::Solver()
    @           0x40d790  caffe::GetSolver<>()
    @           0x407311  train()
    @           0x405891  main
    @     0x7fd197e53ec5  (unknown)
    @           0x405e3d  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

I am a bit new to caffe and Github, hence I didnt understand the earlier part of your reply. Can you elaborate a bit as to what steps I should take to be able to install caffe-future ?

@kashefy
Copy link

kashefy commented Aug 4, 2015

@aalok1993, thanks for taking the time to checkout my changes.
Re-stride of 0.5: I don't think this is possible, given that the stride member in the ConvolutionLayer class is defined as an int. See vision_layer.hpp#68.

I'm still new to caffe myself and still trying to figure out how things are done. I was able to resolve some of the issue but still haven't figured out an end-to-end process for making things work. I have yet to train one of these FCNs successfully myself...

On how to upscale the output, I don't think you need to worry about floating point stride values. The FCN models perform something similar through the Deconvolution layer. It involves bilinear interpolation but I'm a bit lost on the details. Might be worth looking up related posts in the caffe-users group. The implementation already exists in caffe, just a matter of figuring out usage.

Re-building caffe-future: My understanding is that the instructions future.sh are sufficient. The merge conflict that was causing your build error was because you were merging the PRs to the BVLC:master and not longjon:master, which are not in synch at the moment. Did I get that right?

I'll try to respond with something more useful when I've figure out more.

@neurohn
Copy link

neurohn commented Aug 7, 2015

I'm facing similar issues. Will update you guys if I find a solution myself. My next avenue if to check out other implementations of FCN using caffe. This is what I came up with:

@kashefy
Copy link

kashefy commented Aug 7, 2015

I was able to train the FCN32s model through fine tuning successfully. My problem was in that weights of some of the layers of my fully conv. VGG-16 variant were not being copied correctly. Please find more details under this this topic in caffe-users group. All zero weights in these layers will only propagate zeros during training.

Don't you think the deconv. later will upscale your features? If a stride of 1/2 is critical for your algorithm, maybe you can use the deconv. layer to upscale by a factor of 2 using nearest neighbor interpolation (not sure about the details for this) a stride of a consecutive conv layer of 1 then be equivalent of a 1/2 stride with only a single conv layer. Would this work for you?

@neurohn, yes please keep us updated. If it's more about concepts and less about the implementation it may be better to continue that discussion in the caffe-users group.

@aalok1993
Copy link

Hi @kashefy , thanks a lot for your reply. I was able to upscale using deconvolution layer with bilinear weight filler. But I am still facing lots of issues.

Initially I was getting a lots of Nan's and Inf's in my weight parameters. I tried to modify the learning rates and this problem went away. (I wanted to ask that: what various parameters should I try to modify to solve this issue)

After that the issue I am facing is that, when I take an image and pass it forward, most of the blob values are coming out to be zeros and the final image I am getting is an image filled with zeros. And the weight parameters learned by the network become very huge. Below I have described the various outputs in detail.

I am working on a regression problem where my input is a 256X256X3 image and the output is also a 256X256X3 image. In order to figure out the issue, I took a very small architecture(a toy example) which consists of a single convolutional layer, Relu layer, a pooling layer followed by a deconvolution layer. Also, to make it simple initially I am taking the (output label = input data), so currently my network works like an autoencoder. All it has to do is learn an approximation of the identity function. But it fails to do even that. Following are the prototxt files: deploy.prototxt, train_val.prototxt and solver.prototxt.

I trained the network for 1000 iteration and used the snapshot as my model. Following is the code and output, which describes what I obtain after 1000 iterations. (NOTE: I have done the training in GPU mode as well as CPU mode. But I get the same result in each case.)

Initializing caffe and Loading the network

caffe.set_mode_cpu()
net = caffe.Net('MyNet_deploy.prototxt', 'snapshots/MyNet_iter_1000.caffemodel', caffe.TEST)

transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
transformer.set_transpose('data', (2,0,1))
net.blobs['data'].reshape(1,3,256,256)
net.blobs['data'].data[...] = transformer.preprocess('data', caffe.io.load_image('Train/data/0001.jpg'))
out = net.forward()

The blobs

[(k, v.data.shape) for k, v in net.blobs.items()]

[('data', (1, 3, 256, 256)),
('conv1_1', (1, 64, 256, 256)),
('pool1', (1, 64, 128, 128)),
('upsample2', (1, 3, 256, 256))]

The parameters

[(k, v[0].data.shape) for k, v in net.params.items()]

[('conv1_1', (64, 3, 3, 3)),
('upsample2', (64, 3, 4, 4))]

The conv layer weights

print net.params['conv1_1'][0].data

[[[[ -8.90221119 -19.70544052 -21.9944973 ]
[ -28.27580643 -44.15635681 -51.63126373]
[ -39.88535309 -59.30950165 -62.64734268]]

[[ -9.73268604 -21.16998863 -23.3067379 ]
[ -28.68981361 -45.53733826 -52.59268951]
[ -40.60289001 -60.55314255 -63.270298 ]]

[[ -7.46913862 -18.73158836 -20.8146286 ]
[ -26.17634583 -42.74364471 -49.60507965]
[ -37.86455536 -57.53972244 -60.60445023]]]

...,

[[[-1756.36547852 -1774.34521484 -1799.48950195]
[-1785.36962891 -1828.19641113 -1854.27050781]
[-1797.99133301 -1837.64611816 -1851.94775391]]

[[-1765.79675293 -1784.0411377 -1808.77331543]
[-1794.91149902 -1837.94580078 -1863.44091797]
[-1807.38049316 -1847.3157959 -1861.21154785]]

[[-1588.7590332 -1605.53112793 -1629.23632812]
[-1617.13195801 -1658.57910156 -1683.2623291 ]
[-1629.62780762 -1667.82958984 -1681.2208252 ]]]

The deconv layer weights

print net.params['upsample2'][0].data

[[[[ -0.2453279 -0.42636055 -0.52841532 -0.63897181]
[ -0.75671118 -0.7169919 -0.82515067 -1.17307651]
[ -0.96557409 -0.9307059 -1.03437531 -1.36865413]
[ -1.08291376 -1.28496742 -1.37371349 -1.44269586]]

[[ -0.2445658 -0.42509246 -0.52721226 -0.63792503]
[ -0.75604206 -0.71580309 -0.82402509 -1.17214358]
[ -0.96510863 -0.9297061 -1.03346813 -1.36791492]
[ -1.08263409 -1.28413677 -1.37294734 -1.44208062]]

[[ -0.24634758 -0.42725337 -0.52933705 -0.64002675]
[ -0.75833076 -0.71846634 -0.82661229 -1.17464745]
[ -0.9673087 -0.93222517 -1.03590178 -1.37026465]
[ -1.08475745 -1.28656888 -1.37528908 -1.44430864]]]

...,

[[[-83.92314148 -85.89565277 -86.49584961 -86.15866852]
[-86.43471527 -88.2796402 -88.8900528 -88.71788788]
[-87.11362457 -88.95469666 -89.52527618 -89.28121185]
[-86.79399109 -88.77192688 -89.2594986 -88.69790649]]

[[-83.90159607 -85.87146759 -86.47241974 -86.13829803]
[-86.41383362 -88.25655365 -88.86827087 -88.69919586]
[-87.09313202 -88.93185425 -89.50371552 -89.26304626]
[-86.77391815 -88.74938965 -89.23816681 -88.67989349]]

[[-84.18785858 -86.16223145 -86.7639389 -86.43247986]
[-86.70091248 -88.54869843 -89.16176605 -88.99497223]
[-87.38189697 -89.2256546 -89.79877472 -89.56040192]
[-87.06370544 -89.04449463 -89.53528595 -88.9778595 ]]]

The data blob

print net.blobs['data'].data

[[[[ 0.02745098 0.00784314 0.02745098 ..., 0.19607843 0.14509805
0.11764706]
[ 0. 0.13725491 0.66666669 ..., 0.93725491 0.89803922
0.90980393]
[ 0.3019608 0.87058824 0.99607843 ..., 0.9137255 0.89411765
0.90980393]
...,
[ 0.03921569 0.01960784 0.01176471 ..., 0.08235294 0.07843138
0.07843138]
[ 0. 0. 0. ..., 0.08235294 0.07843138
0.07450981]
[ 0.00392157 0. 0.00392157 ..., 0.07450981 0.07058824
0.07058824]]

[[ 0.03137255 0.01176471 0.03137255 ..., 0.21568628 0.16470589
0.13725491]
[ 0.00392157 0.14509805 0.67450982 ..., 0.95686275 0.91764706
0.92941177]
[ 0.30980393 0.87843138 1. ..., 0.93725491 0.91764706
0.93333334]
...,
[ 0.06666667 0.04705882 0.03921569 ..., 0.1254902 0.12156863
0.12156863]
[ 0.02352941 0.01960784 0.01568628 ..., 0.1254902 0.12156863
0.11764706]
[ 0.03137255 0.01568628 0.02352941 ..., 0.11764706 0.11372549
0.11372549]]

[[ 0.01176471 0. 0.01176471 ..., 0.19215687 0.14117648
0.11372549]
[ 0. 0.1254902 0.65490198 ..., 0.93333334 0.89411765
0.90588236]
[ 0.29803923 0.86666667 0.99215686 ..., 0.92156863 0.90196079
0.91764706]
...,
[ 0.03921569 0.01960784 0.01176471 ..., 0.10196079 0.09803922
0.09803922]
[ 0. 0. 0. ..., 0.10196079 0.09803922
0.09411765]
[ 0.00392157 0. 0. ..., 0.09411765 0.09019608
0.09019608]]]]

The conv1_1 blob

print net.blobs['conv1_1'].data

[[[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]

[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]

[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]]]

The pool1 blob

print net.blobs['pool1'].data

[[[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]

[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]

[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]]]

The upsample2 blob

print net.blobs['upsample2'].data

[[[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]

[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]

[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]]]

Some queries

As it can be seen above, the outputs of conv1_1 , pool1 and upsample2 are all filled with zeros. It seem like the net is learning to output a blank image irrespective of the input. Also the weights learned by the FCN consists of many large values. I am unable to understand what is causing this issues. Should I change some parameters to solve this problem ? How should I solve the problem of large weight. Should I include a large weight decay.
I have run the training in both CPU mode as well as GPU mode. But in both the cases I get the same result. So the problem is not related to GPU it seems.
Also I wanted to ask, what is a better weight filler for convolutional layer : xavier or gaussian. Also in gaussian how to take the value of std.
In the next reply I have included the various prototxt files. Could you please help me in understanding how to solve these issues. Thanks a lot.

@rahman-mdatiqur
Copy link

Dear @kashefy
I cloned your caffe version just as @aalok1969 did (git clone https://github.com/kashefy/caffe) and then compiled it successfully. But, while fine-tunig the alexnet 32-stride model, I got an error saying that the crop layer is not defined. Please note that, after cloning your caffe i didn't not run future.sh as provided in longjon's caffe, as doing so creates conflict.

Could you please advise how to make this PR work?

@aalok1993
Copy link

Dear @kashefy
By setting the weight decay parameter to be large, I was able to make the weight smaller, but I still get zeroes as the output of all the layer. I am not able to understand what really is causing this problem.

@rahman-mdatiqur
Copy link

Dear @aalok1969 ,

could you please explain how you got @kashefy 's caffe working using git clone https://github.com/kashefy/caffe? I cloned it and compiled. But, then the deconvolution layer was not found by caffe while running the fine-tuning. I didn't run the future.sh script after cloning @kashefy 's caffe, as trying so caused PR conflicts.

Thanks.

@aalok1993
Copy link

Dear @atique81 ,

I had just performed : git clone https://github.com/kashefy/caffe
and then compiled caffe following the instructions in the following tutorial

This worked perfectly fine for me. (NOTE : I didn't run future.sh)
A sample deploy.prototxt for defining deconvolution layer can be seen here : deploy.prototxt and train_val.prototxt

@rahman-mdatiqur
Copy link

Dear @aalok1969
thank you so much for your reply. I mistakenly mentioned about Deconvolution layer in my last reply, whereas, the error that is generated while running @kashefy 's caffe (after doing clone https://github.com/kashefy/caffe without running future.sh) is missing crop layer, which is actually Pool#1976 and the first pool merge listed in future.sh.

Did you also use the crop layer as mentioned here https://gist.github.com/shelhamer/80667189b218ad570e82#file-train_val-prototxt-L559 ? If so, then I wonder how you could run the fcn fine-tuning from @kashefy 's caffe?

@aalok1993
Copy link

I wont be able to answer that as I am working on a regression problem and not on segmentation. I didnt require the crop layer for my task.
@kashefy mentioned earlier that he was able to run the code for segmentation so he would be able to answer that.

@rahman-mdatiqur
Copy link

Thanks a lot @aalok1969 . Waiting for @kashefy to reply...

@rahman-mdatiqur
Copy link

Dear @kashefy ,
could you please explain how I can make all the merges work without any conflict to run the FCN-Semantic segmentation as given here (https://github.com/longjon/caffe/tree/future)? I have gone through your detailed post regarding this here (https://groups.google.com/forum/#!msg/caffe-users/3eIMYV0OlY8/zXrCDI3OBAAJ). But, I didn't understand the step 1. My problem is all that @aalok1969 was facing while running future.sh.

I would highly appreciate if you kindly reply.

@kashefy
Copy link

kashefy commented Aug 12, 2015

@atique81, did you do 'git checkout with_crop' after cloning my fork. Or
did you merge my PR? Without doing any of these you won't have the
CropLayer class defined.
On Aug 11, 2015 7:29 AM, "atique81" notifications@github.com wrote:

Dear @aalok1969 https://github.com/aalok1969
thank you so much for your reply. I mistakenly mentioned about
Deconvolution layer in my last reply, whereas, the error that is generated
while running @kashefy https://github.com/kashefy 's caffe (after doing
it clone https://github.com/kashefy/caffe without running future.sh) is
missing crop layer, which is actually Pool#1976 and the first merge listed
in future.sh.

Did you also used the crop layer as done here
https://gist.github.com/shelhamer/80667189b218ad570e82#file-train_val-prototxt-L559
? If so, then I wonder how you could run the fcn fine-tuning?


Reply to this email directly or view it on GitHub
#1 (comment).

@peiyunh
Copy link

peiyunh commented Aug 13, 2015

Hi @kashefy , I was trying to reproduce the fcn-8s-pascal-deploy.txt. I followed into this thread and checkout the with_crop branch of your fork, while caffe still does not recognize the CROP layer.

Part of my error message says:

[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 96:21: Unknown enumeration value of "CROP" for field "type".
F0812 19:45:36.754561  6873 upgrade_proto.cpp:928] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: fcn-8s-pascal.prototxt

I'm wondering if you still use CROP layer, or you get away with other options. Thanks!

Edit: I made it work and I will come back later with more details.

@rahman-mdatiqur
Copy link

Dear @kashefy ,

I highly appreciate your feedback. I have just done the following -

git clone https://github.com/kashefy/caffe
git checkout with_fork

But, while running the fcn-32s-alexnet prototxt, it generated an error saying something like "reshape not set", which I managed to overcome by following this guideline (BVLC#2834).

Now its running fine, but the loss seems to be jumping much, though only 2000 iterations have been passed (I am training on 1112 images from pascal voc2011).

I will let you know the update once more iterations are finished.
Please let me know if I am still missing anything from your caffe version.

Thanks again for your wonderful support.

@kashefy
Copy link

kashefy commented Aug 13, 2015

@atique81, glad to hear you're making progress. I didn't run into the
reshape error. So far, I've only trained fcn-32s on the PASCAL-Context
dataset by fine-tuning VGG-16 after making it fully convolutional.
Re-loss: This confused me for a while, eventually I was able to see the
loss drop after 200 iterations, so pretty early on in the training (from

600K to >100K). It dropped less drastically after that (taking several 10k
iterations to drop by 10k) It might be worth running the eval.py script
that come with the pretrained models but plugging in the weights from the
snapshots to gauge how well the network is doing through visual inspection.

Dear @kashefy https://github.com/kashefy ,

I highly appreciate your feedback. I have just done the following -

git clone https://github.com/kashefy/caffe
git checkout with_fork

But, while running the fcn-32s-alexnet prototxt, it generated an error
saying something like "reshape not set", which I managed to overcome by
following this guideline (BVLC#2834
BVLC#2834).

Now its running fine, but the loss seems to be jumping much, though only
2000 iterations have been passed (I am training on 1112 images from pascal
voc2011).

I will let you know the update once more iterations are finished.
Please let me know if I am still missing anything from your caffe version.

Thanks again for your wonderful support.


Reply to this email directly or view it on GitHub
#1 (comment).

@rahman-mdatiqur
Copy link

Dear @kashefy ,

Its now 6000 iterations going on, and the loss is jumping between 0.15 and 0.8. I hope it would become more stable once more iterations are passed.

Thanks for all your cordial help.

@peiyunh
Copy link

peiyunh commented Aug 14, 2015

Hi @kashefy and @atique81 , are you training with single image every iteration or in mini-batch? If you are training in mini-batches, since the aspect ratios are different across different images, did you guys write some code for data preparation? Now I pad all images to be 500x500 to make sure all images are the same size before processing them in batches, but I am wondering if there is any built-in functions in Caffe for this. I'm kind of new to Caffe and I'm still learning the basics. Thanks!

@rahman-mdatiqur
Copy link

Hi @Eric-Phu ,
as per the guidelines provided in FCN semantic segmentation, I am training in mini-batches of size 1. That's why, it doesn't require to resize the inputs. I am also very new to Caffe. But, I guess, if you have a look into their imagenet tutorial (http://caffe.berkeleyvision.org/gathered/examples/imagenet.html), you will get to know how to feed caffe with resized inputs, or how input data layers can resize inputs automatically.
Thanks

@peiyunh
Copy link

peiyunh commented Aug 14, 2015

Hi @atique81 , thanks for your reply! Yeah right now I'm sticking with the explicit resizing strategy as the imagenet tutorial does. While it's weird that I cannot use a batch size like 20 according to the original FCN paper. I posted the memory issue in the Google group. Check it out if you'd like. BTW, what kind of speed do you get when training the FCN-32s?

@rahman-mdatiqur
Copy link

Dear @Eric-Phu ,

I am running FCN-32s on an Nvidia GeForce GTX 980 GPU with 4GB memory. Its taking approximately 1.04 Sec for one complete forward+reverse pass.

Just curious to know about how you did the resize for your ground truth images, as unlike the training images, simple interpolation method won't work for ground truth labels, (as it would create new class numbers).

Could you please elaborate on this?

@peiyunh
Copy link

peiyunh commented Aug 14, 2015

Hi @atique81 , thanks for your reply!

Your speed is pretty nice. Mine is 5 sec per image on Tesla K40c, which makes me wonder if I did something wrong. Did you set the ‘group’ for deconvolution? I saw people’s posting about it. But whenever I set the group to be 60 by adding group: 60, which is the same number as the num_output, Caffe crashed. The error message is

F0813 21:56:23.929918 11210 blob.cpp:455] Check failed: ShapeEquals(proto) shape mismatch (reshape not set)

You mentioned this above, I wonder if it's caused by adding group.

Actually, you do not need to resize images. Since the images in Pascal VOC has the longer side to be 500. So all you have to do is pad the other side with mean RGB values, which at least is what I did. For ground truth labels, you may want to pad with zero, which represents the background.

@rahman-mdatiqur
Copy link

Yes, I did. So far I know (from this source: http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1BilinearFiller.html), the deconvolution layer weights dimension should be : Cx1xKxK, where, C is num_output as well as group value. Just check with that link.

I am not sure if padding parts of images with mean rgb values and the corresponding parts in labels with background (class 0) will comply or not. I was in need of resizing the pascal images and labels once and someone advised me to resize the labels based on a voting principle, instead of interpolation, which I was not sure of. That's why I went for single-image batch training.

@peiyunh
Copy link

peiyunh commented Aug 14, 2015

Thanks for pointing out the link. I checked it out. It's just weird that even I used the exact same protobuf snippet and replace factor with 64, it still does not work. I know this is too much to ask, do you care to share your network like train_val.prototxt text by gist or something? Thanks!

About the padding vs interpolation, I'm actually not sure which is the right way to go, since the original paper did not talk about it neither.

@peiyunh
Copy link

peiyunh commented Aug 14, 2015

Hi @atique81 , I found the reason why I cannot make the group thing work. Because I did not re-run the net surgery code after I make changes to the train_val.prototxt. By using group, I can save a little bit of memory and much faster than before, although I cannot use a batch size of 20 still.

Another gotcha from Caffe. Anyway, problem solved. Thanks!

@rahman-mdatiqur
Copy link

Hi @Eric-Phu , nice to know that you solved it.

@rahman-mdatiqur
Copy link

Hi @kashefy ,

now that I have been able to train fcn 32 stride model on pascal voc2011, I am trying to test the net on pascal voc2011 validation data. But, unfortunately, caffe exits showing insufficient memory. Please note that I trained the model keeping batch size 1 and during testing batch size is also fixed to 1.

Could you please advise what is going wrong?

Thanks

@aalok1993
Copy link

Hi @kashefy

I am facing a problem while training.
All the outputs are coming out to be zeroes. And the testing error doesnt reduce at all. Do you know what might be causing this problem. I am posted the detailed problem and the outputs of all the layer in the thread before. Thanks.

@shelhamer
Copy link
Collaborator

The longjon/caffe:future has been rebased on BVLC/caffe:master so the merge conflicts that have been brought up should be settled.

@lolz0r
Copy link

lolz0r commented Aug 25, 2015

@shelhamer Thanks for updating this! Question: Does future.sh still need to be ran? If so it is still causing issues with merging of the vision layer.

@shelhamer
Copy link
Collaborator

No, just checkout longjon/caffe:future and use it as-is.

@aivision2020
Copy link

@aalok1993 when you midified the model did you set a weight_filler (the default is to set the wieghts to zero - a stupid default if i know one) check out gausian, xavier...

@aivision2020
Copy link

@atique81 @kashefy I really think setting batch size to 1 is a big mistake. remember, no image contains examples of all the classes, to the gradient will be skewed. If memory is the problem (and it is) you can use iter_size
check out
https://groups.google.com/forum/#!topic/caffe-users/PMbycfbpKcY

I'm using iter_size = 20 and batch_size = 1

@aivision2020
Copy link

quick question, does the loss layer treat "background" different or is it just another class?

@shelhamer
Copy link
Collaborator

@aivision2020 actually we've found batch_size == 1 to be effective when paired with high momentum. See the PASCAL Context FCN in the model zoo: https://gist.github.com/shelhamer/80667189b218ad570e82#file-readme-md

Eventually the arxiv will be updated with more comments on this.

@FishermanZzhang
Copy link

@aalok1993
dear aalok1993
I want to see your .prototxt,but when I click the link ,it goes wrong. Can you send me the two file solver.prototxt and train_val.prototxt to the mail 14120452@bjtu.edu.cn .Thank you very much .

@aalok1993
Copy link

Hi, I sent you the 3 files on your mail

@FishermanZzhang
Copy link

ok,I got it .thank you very much @aalok1993

@bruceko
Copy link

bruceko commented Nov 3, 2015

Hi, I am wondering which step I did wrong that I cannot run eval.py from FCN-32s. (https://gist.github.com/shelhamer/80667189b218ad570e82#file-readme-md)

I did git clone https://github.com/longjon/caffe/tree/future
git checkout future
make
...
Then I have the image paths put and set but I still get the error said that unknown layer Crop when trying to run eval.py.
I checked caffe.proto under caffe_root/src/caffe/proto and I saw the crop setting in it.
Could anyone tell me which step I did wrong and how to fix it?
Thanks.

@ghost
Copy link
Author

ghost commented Nov 5, 2015

@bruceko Hi, I have met the same problem with you. I also checked the caffe.proto but I cannot find the crop_param as what other layers have, like sigmoid_param and softmax_param. So what do you see in the caffe.proto that is related to the CropLayer?

I wonder I have mis-installed the future release. I only download and unzip the caffe-future.zip. Then I use the Makefile.config as in other Caffe braches and run make all.

Well, have you got your problem fixed? Do you have any suggestions? Thanks!

@bruceko
Copy link

bruceko commented Nov 5, 2015

@Jianchao-ICT I haven't figured out how to solve the problem yet.
Jon posted how he added a new layer to Caffe in BVLC#684 and now you can find it at https://github.com/BVLC/caffe/wiki/Development.
I think there is no parameter for crop layer, so you cannot find it.
(You might only be able to find the ID for it in caffe.proto.)
However, you could find the crop layer has been defined in vision_layers.hpp.

@ghost
Copy link
Author

ghost commented Nov 5, 2015

@bruceko Thanks! Well, I have noticed that CropLayer has no parameters and I also see the CROP in caffe.proto now. Well, just hope to get the problem fixed. Thank you for the nice links 👍

@ghost
Copy link
Author

ghost commented Nov 5, 2015

@bruceko In fact, I wonder whether I have got caffe-future installed correctly. I just download, unzip and make caffe-future.zip without using future.sh (someone seemed to mention that it should be used).

@bruceko
Copy link

bruceko commented Nov 5, 2015

@Jianchao-ICT I only tried the steps I posted to install caffe-future.
I skipped running future.sh because l got into some trouble last time.
Since I have already checked with the development guide, I just ignore that.

@ghost
Copy link
Author

ghost commented Nov 5, 2015

@bruceko I try to run git clone https://github.com/longjon/caffe/tree/future in the linux terminal command line, but the following error appears. Have you met with it?

Cloning into 'future'...
p11-kit: invalid config filename, will be ignored in the future: /etc/pkcs11/modules/gnome-keyring-module
fatal: repository 'https://github.com/longjon/caffe/tree/future/' not found

@bruceko
Copy link

bruceko commented Nov 5, 2015

@Jianchao-ICT I'm using ubuntu and I don't have such problem.
I think you are using other Linux system and you might want to look at this http://forum.mepiscommunity.org/viewtopic.php?f=94&t=36357

@ghost
Copy link
Author

ghost commented Nov 5, 2015

@bruceko Well, it seems that my Linux is also Ubuntu?

lijianchao@cuda-server:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 12.04.5 LTS
Release:    12.04
Codename:   precise

@ghost
Copy link
Author

ghost commented Nov 5, 2015

@kashefy Hi, I have read your detailed comments above. Now I am just trying to run the eval.py script of FCN-32s and I encounter a problem which says that Crop is an unknown layer to Caffe. I check for files related to the CropLayer and find nothing wrong. My problem is posted in this issue. Could you help me with it? Thanks!

@ghost
Copy link
Author

ghost commented Nov 5, 2015

@bruceko Hi, I have found the reason on my machine why Caffe would report CropLayer to be unknown. The reason is that I have another caffe-master branch on my machine, which is compiled. In eval.py, when import caffe is executed, the caffe module of caffe-master is imported so it cannot recognize the CropLayer. You may verify this by printing help(caffe) and check the Path information. Anyway, I have just noticed it and am still trying to fix it.

@bruceko
Copy link

bruceko commented Nov 5, 2015

@Jianchao-ICT Thanks for your information. I do have the same problem.
I installed several repos for Caffe and didn't make the distribution for that.
I could run the eval.py by changing the path in .bashrc but I got some warning for that.
export CAFFE_HOME=${HOME}/caffe
Change caffe to caffe-future or other folder you have.
Hope the solution I used could help you too.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 597011289
I am working on other stuff, so I might not be able to solve the problem we have with you.
Hope you could get your results soon and you might be able to help me then.

@ghost
Copy link
Author

ghost commented Nov 6, 2015

@bruceko Yes, I change PYTHONPATH and eval.py works now. BTW, I think the warning message is simply due to the .prototxt of FCN-32s is just too large and has nothing wrong with your code.

longjon pushed a commit that referenced this issue Dec 28, 2015
@bhack
Copy link

bhack commented Jan 11, 2016

@longjon @shelhamer Any plan to merge to master with a PR?

@thuanvh
Copy link

thuanvh commented Jan 20, 2016

Hi all and @aalok1993
I have same problem that my training output is always zero. The training loss does not decrease.
Do you have any suggestion?
Thank you all,
Thuan

@CarrieHui
Copy link

@shelhamer Hi, I encountered conflicts when I merged PR BVLC#2016 , it says "Automatic merge failed", should I fix conflicts manually? Thanks in advance.

@ahundt
Copy link

ahundt commented Apr 5, 2016

Equivalent code is already merged to master in github.com/BVLC/caffe, in case people here weren't aware.

@shelhamer
Copy link
Collaborator

Hey all,

Check out the fcn.berkeleyvision.org repo for master editions of the reference networks, weights, and code for learning, inference, and scoring.

Closing this issue since the future branch is now deprecated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests