Add memory data layer to pass data directly into the network #196

kloudkl · 2014-03-09T23:05:11Z

As stated in the big umbrella issue #148, current version of DataLayer is bound to specific data source and data format. Ideally data from any external source and in any format should be able to stream though the network. This PR provides an in memory gateway for such a purpose.

Although the memory data layer can be straightforwardly integrated with #120 to enable feeding raw images in memory into the network, it simply copies memory blobs at the moment. If anyone is interested in such a design, I will further refine and extend it.

shelhamer · 2014-03-09T23:33:38Z

Passing input in memory is conveniently general purpose, and I could see its use for camera input or even realtime operation.

A memory input path should be helpful in combination with data processing layers by #148 and batch size adapting in #195. If we adopt this data layer, should we not drop the input field format and rewrite the examples with this new layer? To merge it this should work as simply as the input fields.

The only question is whether the current input field method suffices? Dimensions are still programmable in-memory by modifying the ProtoBuf definition once it is loaded. An example of this is in the ImageNet deployment model imagenet.prototxt.

kloudkl · 2014-03-11T10:01:39Z

Realtime operations usually use multi-threading and therefore need a thread safe data buffer behind the scene. The Intel thread building block has the suitable concurrent containers but it is a heavy weight dependency.

The input field is not very straightforward to use. The different look of the deployment prototxts make the users wonder why don't they have a data layer like the train and test model definitions. But it should be kept for backward compatibility. It is up to the users to choose which method to use.

shelhamer · 2014-03-11T17:30:47Z

Agreed about the confusion over the input fields. Having data layers throughout for every phase and purpose will be clearer. Please update the example model definitions accordingly (imagenet.prototxt, lenet.prototxt, etc.). I'll look forward to merging, if this is done in such a way that it is simpler than input fields.

Realtime and its input needs naturally can wait for its own PR.

shelhamer · 2014-03-13T18:13:09Z

@sguada I believe you had review comments for this PR? @kloudkl, looking forward to merge post-review!

sguada · 2014-03-14T05:45:49Z

src/caffe/layers/memory_data_layer.cpp

+template <typename Dtype>
+void MemoryDataLayer<Dtype>::SetUp(const vector<Blob<Dtype>*>& bottom,
+      vector<Blob<Dtype>*>* top) {
+  CHECK_EQ(bottom.size(), num_data_blobs_) <<


@kloudkl The idea behind of Data Layers is that they don't any bottom blobs, they just provide top blobs of data for the next layers.
Your current implementation only copy a bottom blob into the top blob, but which layers defines the bottom blob? which part of the code will be in charge of copying the data in there?

sguada · 2014-03-14T05:52:56Z

@kloudkl I think having a memory data layer is a good idea, but the current design and implementation is flawed.
Also the idea of a memory data layer, should be in agreement with current python and matlab wrappers.
A full test of a network containing a memory data layer will be needed.

shelhamer · 2014-03-14T06:08:17Z

Seconding @sguada, it will be nice to replace the potentially confusing "input" fields with a memory data layer, but this layer must keep compatibility with the wrappers / update the wrappers to use the new layer.

Yangqing · 2014-03-14T23:36:00Z

In general this seems to be coercing data preprocessing into a layer, which I do not quite support - the memory data layer expects a blob anyway, thus not reducing complexity at all. I think a better way is to write utility functions under util/ that enables different data preprocessing, such as OpenCVImageToBlob() and RawMemoryChunkToBlob().

sguada · 2014-03-14T23:41:34Z

@Yangqing I also think data preprocessing should be separated from the memory data layer, it could be a new type of layer, or some part of all data layers. I think it will be cleaner to be a separated layer, but it will require some extra memory.

Yangqing · 2014-03-14T23:51:51Z

@sguada I agree. In my mind a layer does two things:

(1) takes in a set of blobs, and spits a set of other blobs (or no blobs)
(2) takes no blobs, and produces blobs from e.g. disks, where the source is defined in its layer param.

A memory data layer, as I expect it, is to load a dataset in memory (the dataset will still be specified as some sort of on-disk file), and then spit blobs directly from memory. In this sense it will be not much different from datalayer, the only difference is where the data come from (our other working direction of separating dataread and datapreprocess fits in this purpose nicely).

I feel that having the input fields is fine - it tells us what data size the net expects, and needs to be specified somewhere anyway. By putting it at the top of the protobuf it makes it easier to examine.

shelhamer · 2014-03-14T23:58:32Z

@Yangqing, I think an input data layer that only provides a top but exposes
assignment to its top blobs, as suggested by @sguada, would be nice. It's
only the input fields (with declared dimensions and all) under another
name, but it would

be less confusing / more consistent
be more explicit in preparation for Consolidate network definitions #57

On Fri, Mar 14, 2014 at 4:51 PM, Yangqing Jia notifications@github.comwrote:

@sguada https://github.com/sguada I agree. In my mind a layer does two
things:

(1) takes in a set of blobs, and spits a set of other blobs (or no blobs)
(2) takes no blobs, and produces blobs from e.g. disks, where the source
is defined in its layer param.

A memory data layer, as I expect it, is to load a dataset in memory (the
dataset will still be specified as some sort of on-disk file), and then
spit blobs directly from memory. In this sense it will be not much
different from datalayer, the only difference is where the data come from
(our other working direction of separating dataread and datapreprocess fits
in this purpose nicely).

I feel that having the input fields is fine - it tells us what data size
the net expects, and needs to be specified somewhere anyway. By putting it
at the top of the protobuf it makes it easier to examine.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/196#issuecomment-37708581
.

kloudkl · 2014-03-15T13:56:59Z

This is exactly an attempt to implement the simplest possible data layer consistent with #148 whose discussion is hard to push forward without a concrete example. Yes, the intention is to separate reusable data io, format handling and preprocessing from the data layers.

The input blobs come from some types of DataSource which does not necessarily have to be a layer. And there also need to be DataSink for the output blobs (#213).

The updated example protos show that the data layers can also specify the data sizes that they expect.

It is tempting to directly assign data to the top blobs which are not constant. But then you take the risk of conflicts caused by multiple writing threads or processes.

kloudkl · 2014-03-23T07:32:53Z

Closing this which is continued by #251.

shelhamer added the enhancement label Mar 9, 2014

kloudkl added 5 commits March 11, 2014 18:35

Add memory data layer to pass data directly into the network

7277b51

Add memory data layer in the layer factory

da1c1cb

Add dynamic batch size test cases of memory data layer

51655ba

Add datum dimensions field to memory data layer

ab02059

Add example deployment proto files to demonstrate memory data layer

ee905de

kloudkl mentioned this pull request Mar 11, 2014

Feature extraction, feature binarization and image retrieval examples #161

Merged

shelhamer added the interface label Mar 12, 2014

Update the example model definitions to use memory data layer

cb3017d

shelhamer added the work in progress label Mar 13, 2014

sergeyk assigned sguada Mar 13, 2014

sguada reviewed Mar 14, 2014
View reviewed changes

This was referenced Mar 17, 2014

Util functions converting formats between HDF5 and Blob #220

Closed

Separate the reusable data processors from the data layers #244

Closed

Feed the ImageDataLayer with OpenCV images directly from memory #251

Closed

kloudkl closed this Mar 23, 2014

shelhamer removed the work in progress label Mar 23, 2014

longjon mentioned this pull request Apr 5, 2014

Add a layer for in-memory datasets, and expose it to Python #294

Merged

sguada mentioned this pull request Jun 17, 2014

how to make a prediction in C++ #499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add memory data layer to pass data directly into the network #196

Add memory data layer to pass data directly into the network #196

kloudkl commented Mar 9, 2014

shelhamer commented Mar 9, 2014

kloudkl commented Mar 11, 2014

shelhamer commented Mar 11, 2014

shelhamer commented Mar 13, 2014

sguada Mar 14, 2014

sguada commented Mar 14, 2014

shelhamer commented Mar 14, 2014

Yangqing commented Mar 14, 2014

sguada commented Mar 14, 2014

Yangqing commented Mar 14, 2014

shelhamer commented Mar 14, 2014

kloudkl commented Mar 15, 2014

kloudkl commented Mar 23, 2014

Add memory data layer to pass data directly into the network #196

Add memory data layer to pass data directly into the network #196

Conversation

kloudkl commented Mar 9, 2014

shelhamer commented Mar 9, 2014

kloudkl commented Mar 11, 2014

shelhamer commented Mar 11, 2014

shelhamer commented Mar 13, 2014

sguada Mar 14, 2014

Choose a reason for hiding this comment

sguada commented Mar 14, 2014

shelhamer commented Mar 14, 2014

Yangqing commented Mar 14, 2014

sguada commented Mar 14, 2014

Yangqing commented Mar 14, 2014

shelhamer commented Mar 14, 2014

kloudkl commented Mar 15, 2014

kloudkl commented Mar 23, 2014