-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unrolled recurrent layers (RNN, LSTM) #2033
Conversation
af9f11d
to
34230c6
Compare
Firstly, thanks for the fantastic code. I had been playing with my own LSTM, and found this PR, and it is above and beyond any of my own attempts. Really nice job. There seems to be a bug in the ReshapeLayer of this PR. In some cases, the ReshapeLayer will produce all zeros instead of actually copying the data. I've created a minimal test case that shows this failure for this PR:
Above, it loads a random dataset, ToyData_1. It then reshapes it to the exact same size (identity) to create ToyData_2. We would expect that || ToyData_1 - ToyData_2 ||_2 == 0 However, if you train with the above model on this branch, you will see that the Euclidean loss between ToyData_1 and ToyData_2 is non-zero. Moreover, the loss between ToyData_2 and a blob of all zeros is zero. Note that, as expected, the loss between ToyData_1 and all zeros is non-zero. It seems there is a bug with reshape. I've fixed it here by copying an older version of Reshape into this branch: https://github.com/cvondrick/caffe/commit/3e1a0ff73fef23b8cb8adc8223e0bb2c8900e56b Unfortunately, I didn't have time to write a real unit test for this. But, hope this bug reports helps. The same issue occurs in #2088 Carl |
Well that's disturbing... I don't have time to look into it now but thanks
|
Oops, failed to read to the end and see that you already had a fix. Thanks
|
Thanks Jeff -- yeah, we fixed it by copying a ReshapeLayer from somewhere else. Unfortunately, we have lost track of exactly where that layer came from, but I'm sure somebody here (maybe even you) wrote it at some point. |
When is this feature going to be ready? Is there something to be done? |
For the captioning model, can anyone show me how to generate captions after the training is done? Current LSTM layers process the whole input sequence (20 words in the coco example) across time, but we need to generate one by one at each time step (current time step is the input to the next). |
I've just tried to run train_lcrn.sh (after running coco_to_hdf5_data.py and other scripts) and I get a "dimensions don't match" error:
The stack-trace and log are here: http://pastebin.com/fWUxsSmv I've uncommented line 471 in net.cpp to find the faulty layer (the only modification). It seems it happens in lstm2 which blends input from the language model and from the image CNN. train_language_model.sh runs fine without errors. Ideas? |
By the way, does Caffe's recurrent layer support bi-directional RNN? |
Both factored and unfactored setups are concerned. Seems there are some dimensions problems while blending CNN input with embedded text input. |
I have the same question as @thuyen. My understanding is that the current unrolled architecture slices an input sentence and feed the resulting words to each time step at once. So, for both train and test nets, the ground truth sentences are fed to the unrolled net. However, for captioning an image, there is no sentence to give to the net. But I don't think it is correct to give the start symbol to each layer. Did I miss anything? |
The dimension check fails for the static input (the image feature) with size 100_4000 vs 1_100*4000. It seems to be caused by Reshape layer; @cvondrick 's fix seems to solve this. |
Yes, as noted by @cvondrick, this works with the older version of the ReshapeLayer which puts everything in |
You can create a bi-directional RNN using 2 RNN layers and feeding one the input in forward order and the other the input in backward order, and fusing their per-timestep outputs however you like (e.g. eltwise sum or concatenation). |
Thanks @jeffdonahue , training lrcn now works! Same question as @thuyen @ritsu1228. Does anyone have an idea how to hook up to when the first word after the start symbol gets produced and put the next symbol on the input_sentence tensor in memory before the next round of unrolled net will get to run? |
d3ebf3e
to
80e9c41
Compare
As @jeffdonahue mentioned, bidirectinal RNN can be built with two RNNs, it's easy to prepare reversed input sequence, but how to reverse the output of one RNN when fusing two RNN outputs in Caffe? It seems no layer does the reverse. |
True; one would need to implement an additional layer to do the reversal. You'd also need to be careful to ensure that your instances do not cross batch boundaries (as is allowed by my implementation as it works fine for unidirectional architectures) since inference at each timestep is dependent on all other timesteps in a bidirectional RNN.
In the not-too-distant future I'll add code for evaluation, including using the model's own predictions as input in future timesteps as you mention. |
I've also gotten a number of questions on the optional third input to RecurrentLayer -- I've added some clarification in the original post:
|
Thanks for the fantastic code. But the code of the Reshape function in Recurrent layer makes me confused. when passing data from "output_blobs_" to "top blobs", why it is output_blobs_[i]->ShareData(*top[i]);
output_blobs_[i]->ShareDiff(*top[i]); rather than top[i]->ShareData(*output_blobs_[i]);
top[i]->ShareDiff(*output_blobs_[i]); it seems that the top blobs is just reshaped and empty. the original code is here: template <typename Dtype>
void RecurrentLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
CHECK_EQ(top.size(), output_blobs_.size());
for (int i = 0; i < top.size(); ++i) {
top[i]->ReshapeLike(*output_blobs_[i]);
output_blobs_[i]->ShareData(*top[i]);
output_blobs_[i]->ShareDiff(*top[i]);
}
x_input_blob_->ShareData(*bottom[0]);
x_input_blob_->ShareDiff(*bottom[0]);
cont_input_blob_->ShareData(*bottom[1]);
if (static_input_) {
x_static_input_blob_->ShareData(*bottom[2]);
x_static_input_blob_->ShareDiff(*bottom[2]);
}
} |
Ah, I didn't know the HDF5OutputLayer worked that way, I see... sounds a little scary, but might work... good luck! |
@shaibagon Thanks for the hightlight but I struggle to see how to handle signals with different lengths (ie timestep) for the training process using NetSpec? I can't change my unrolled net architecture during training... |
@fl2o AFAIK, if you want exact backprop for recurrent nets in caffe, there's no way around explicitly unrolling the net across ALL time steps. Regarding working with very long sequences:
Can you afford all these bolbs in memory at once? |
@jeffdonahue BTW, is there a reason why this PR is not merged into |
@shaibagon I am gonna try padding shorter sequences with some "null" data/label (Should I use a special term or just 0 ?) in order to avoid the gradient estimation problem, but I am not sure yet about the memory issue..! (maxT will be around 400! while minT ~50) |
@fl2o I'm not certain just using 0 is enough. You want no gradients to be computed from these padded time steps. You might need to have an "ignore_label" and implement your loss layer to support "ignore_label". |
That's what I was wondering .... |
@fl2o in the future, I think it would be best to keep this github issue thread for PR related comments only. For more general inquires and questions about LSTM in Caffe, it might be better to ask a question in stackoverflow. |
@shaibagon Cheers for all the helpful comments. |
Hi, I used the LRCN code to generate captions form an image. I replace the alexNet with google net. The result likes this: |
print ('Exhausted all data; cutting off batch at timestep %d ' + | ||
'with %d streams completed') % (t, num_completed_streams) | ||
for name in self.substream_names: | ||
batch[name] = batch[name][:t, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
words at timestep t might not be deleted:
batch[name] = batch[name][:(t+1), :]
Could anyone tell me what's the difference between C_diff and C_term_diff in the backward_cpu function? I'm trying to understand the code and write a GRU version. Thanks in advance! const int num = bottom[0]->shape(1); |
@jeffdonahue captioner.py for generating sentence, to generate the current word, captioner.py only use the previous one word not all the previous words? |
Does any know if there is a pre-trained image captioning LRCN model out there? I'd greatly appreciate if this is included in the Model Zoo. @jeffdonahue : would you be able to release the model from your CVPR'15 paper? |
Has this branch been landed to the master ? The layers are in the master, but it seems the examples are not there. Could anyone point to me to the right way to get this branch ? I did git pull #2033, but just showed Already up-to-date. |
@anteagle it seems like the PR only contained the LSTM RNN layers and not the examples (too much to review). You'll have to go to Jeff Donahue's "recurrent" branch. |
@shaibagon thanks, I got from Jeff's repo, though it has not been updated for a while. |
Closing with the merge of #3948 -- though this PR still contains examples that PR lacked, and I should eventually restore and rebase those on the now merged version. In the meantime I'll keep my |
hello,I have a question .When I read the file 'lstm_layer.cpp',I find a lot of 'add_top','add_bottom','add_dim',but I can't find the definition of them in caffe folder.Could you tell me where can I them and whats the meaning of the code such as 'add_bottom("c_" + tm1s);'. |
The methods you refer to are all automatically generated by protobuf. See |
oh , Thank you very much. I have not find this file(caffe.pb.h) because I haven't complied it before! |
Hi, is there any working example of the layer in caffe? |
The same question, is there any working example of the layer in caffe? |
@cuixing158 @soulslicer jeffdonahue's example for coco image caption task. Go for his caffe branch and you will find the example |
(Replaces #1873)
Based on #2032 (adds EmbedLayer -- not needed for, but often used with RNNs in practice, and is needed for my examples), which in turn is based on #1977.
This adds an abstract class
RecurrentLayer
intended to support recurrent architectures (RNNs, LSTMs, etc.) using an internal network unrolled in time.RecurrentLayer
implementations (here, justRNNLayer
andLSTMLayer
) specify the recurrent architecture by filling in a NetParameter with appropriate layers.RecurrentLayer
requires 2 input (bottom) Blobs. The first -- the input data itself -- has shapeT x N x ...
and the second -- the "sequence continuation indicators"delta
-- has shapeT x N
, each holdingT
timesteps ofN
independent "streams".delta_{t,n}
should be a binary indicator (i.e., value in {0, 1}), where a value of 0 means that timestep t of stream n is the beginning of a new sequence, and a value of 1 means that timestep t of stream n is continuing the sequence from timestep t-1 of stream n. Under the hood, the previous timestep's hidden state is multiplied by these delta values. The fact that these indicators are specified on a per-timestep and per-stream basis allows for streams of arbitrary different lengths without any padding or truncation. At the beginning of the forward pass, the final hidden state from the previous forward pass (h_T
) is copied into the initial hidden state for the new forward pass (h_0
), allowing for exact inference across arbitrarily long sequences, even ifT == 1
. However, if any sequences cross batch boundaries, backpropagation through time is approximate -- it is truncated along the batch boundaries.Note that the
T x N
arrangement in memory, used for computational efficiency, is somewhat counterintuitive, as it requires one to "interleave" the data streams.There is also an optional third input whose dimensions are simply
N x ...
(i.e. the first axis must have dimensionN
and the others can be anything) which is a "static" input to the LSTM. It's equivalent to (but more efficient than) copying the input across theT
timesteps and concatenating it with the "dynamic" first input (I was using myTileLayer
-- #2083 -- for this purpose at some point before adding the static input). It's used in my captioning experiments to input the image features as they don't change over time. For most problems there will be no such "static" input and you should simply ignore it and just specify the first two input blobs.I've added scripts to download COCO2014 (and splits), and prototxts for training a language model and LRCN captioning model on the data. From the Caffe root directory, you should be able to download and parse the data by doing:
Then, you can train a language model using
./examples/coco_caption/train_language_model.sh
, or train LRCN for captioning using./examples/coco_caption/train_lrcn.sh
(assuming you have downloadedmodels/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
).Still on the TODO list: upload a pretrained model to the zoo; add a tool to preview generated image captions and compute retrieval & generation scores.