Add a layer for in-memory datasets, and expose it to Python #294

longjon · 2014-04-05T04:12:47Z

(This PR is related to but distinct from #196. Together with #286 it addresses #135.)

This PR adds a new layer called MemoryDataLayer that accepts two contiguous blocks of memory (for data and labels) and (in Forward) updates the top blobs to walk along the provided memory.

Raw pointers are used by MemoryDataLayer; should shared_ptrs be used instead?
Blob and SyncedMemory both get a set_cpu_data method to allow them to point to memory owned by someone else
As a result, no copying is performed
As a result, input data length must be a multiple of batch size
A special method, Net.set_input_arrays is added to pycaffe for telling the MemoryDataLayer to point to ndarray data
Labels are passed into this method as a 4d array for uniformity (but could be a 1d array for convenience?)
Tests do not yet exist for MemoryDataLayer (perhaps merging should wait until they do?)

When combined with #286, training in Python with ndarray data is now possible, e.g.,

solver = caffe.SGDSolver('solver.prototxt')
solver.net.set_input_arrays(data, labels)
solver.solve()

shelhamer · 2014-04-05T05:23:37Z

I can't wait for this data layer!

Raw pointers are used by MemoryDataLayer; should shared_ptrs be used instead?

The raw pointer is fine. A shared_ptr would only be needed if the ndarray or MemoryDataLayer wanted to keep ownership after one or the other went away... and that shouldn't come up.

Blob and SyncedMemory both get a set_cpu_data method to allow them to point to memory owned by someone else

As a result, no copying is performed

Nice.

As a result, input data length must be a multiple of batch size

~~This should be explicitly checked.~~ It'd be nice to explicitly check this, but better to pad with zeros if need be. One would want to politely return only the non-zero outputs on the last Forward(), but perhaps I'll take care of that in my #291.

A special method, Net.set_input_arrays is added to pycaffe for telling the MemoryDataLayer to point to ndarray data

To me, the ideal interface would be to simply assign to net.blobs['input'].data, but I like how you've done it here and this is worth a special case method.

Labels are passed into this method as a 4d array for uniformity (but could be a 1d array for convenience?)

Accept both. The 4d will be important for multilabel and regression inputs and we should sort it out now instead of bolting it on later. However, a 1d array is convenient, like you said, so why not accept a 1d arg and upgrade it to a 4d array?

Tests do not yet exist for MemoryDataLayer (perhaps merging should wait until they do?)

Tests are a must.

longjon · 2014-04-05T07:18:21Z

Right, shared_ptr is not a Python concern; in fact that is handled by reference counting via boost by storing boost::python::objects. One might in the future want to use the MemoryDataLayer from C++, and then shared_ptr might be the way to go; but we can (and probably should) postpone that decision until the case arises.
Data length is checked to be a multiple of the batch size. Zero padding is not so straightforward; one does not want to train on zero inputs with zero labels! Perhaps partial-batch training could be worked out, but that's not so straightforward either; Backward passes compute gradients per-batch. Maybe Make SyncedMem and Blob be able to increase capacities without deallocation and reallocation #250 will help, maybe not. Or we can give up on no-copy.
I considered just having an input blob that stores all the data. But, this double allocates for the diff, which isn't needed.
I'll add accepting 1d labels. (There's actually no extra work to do, since the memory layout is the same.)
Tests will come in the next couple days.

kloudkl · 2014-04-05T08:46:30Z

There is no reason to fix the batch size. #250 is paving the way for #195.

kloudkl · 2014-04-05T08:53:16Z

Don't worry about using a Blob to store all the data. The original SyncedMem and thus Blob do not allocate any memory until it is used. #250 also strives to be as lazy as possible.

shelhamer · 2014-04-05T16:23:53Z

Re: data length everything's fine. No copy is worth it, and don't worry
about partial batches. I wasn't thinking. (Sorry I didn't check to see it
was already checked.)

Le samedi 5 avril 2014, longjon notifications@github.com a écrit :

Right, shared_ptr is not a Python concern; in fact that is handled
by reference counting via boost by storing boost::python::objects. One
might in the future want to use the MemoryDataLayer from C++, and then
shared_ptr might be the way to go; but we can (and probably should)
postpone that decision until the case arises.

Data length is checked to be a multiple of the batch size. Zero
padding is not so straightforward; one does not want to train on zero
inputs with zero labels! Perhaps partial-batch training could be worked
out, but that's not so straightforward either; Backward passes compute
gradients per-batch. Maybe Make SyncedMem and Blob be able to increase capacities without deallocation and reallocation #250https://github.com/BVLC/caffe/pull/250will help, maybe not. Or we can give up on no-copy.

I considered just having an input blob that stores all the data.
But, this double allocates for the diff, which isn't needed.

I'll add accepting 1d labels. (There's actually no extra work to do,
since the memory layout is the same.)

Tests will come in the next couple days.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/294#issuecomment-39631179
.

shelhamer · 2014-04-09T19:55:01Z

Needs rebase + tests, then I'll review and merge. Thanks @longjon.

jlreyes · 2014-04-17T23:46:42Z

Wow, this is awesome, I was looking to implement something similar to this for my project, but I'm glad I saw this first! Thanks @longjon!

In my project my outputs aren't labels, so they don't have the required Nx1x1x1 dimensions. There's nothing inherently preventing me from getting this to work with non label outputs, correct?

If I modified the MemoryDataParameter to take in both input and output dimension information, modified the label dimension checks, and then modified line 42 in memory_data_layer.cpp appropriately, everything should work?

I only recently started getting familiar with the caffe source, so forgive me for my ignorance. Just wanted to make sure I wasn't missing anything fundamental!

longjon · 2014-04-18T00:15:58Z

@jlreyes yes, everything should work the way you describe. You'll also need to modify the checks in the Python wrapper if you'll be using that.

We could consider generalizing this layer to take an arbitrary specification of any number of input blobs of various sizes. That actually feels more natural to me than the way it's done now, but I'll probably not make those changes for this PR.

(This PR is almost ready for review, I just need to chase down a possible crash.)

longjon · 2014-04-25T21:39:57Z

Bug fixed, code rebased, history rewritten, basic tests added, 1d label arrays accepted, all tests and lint pass, ready for review and merge.

longjon · 2014-04-25T21:58:59Z

For some reason these changes expose a lint error that make lint was silent about on dev. The last commit fixes this.

jeffdonahue · 2014-04-25T22:04:57Z

include/caffe/vision_layers.hpp

@@ -325,6 +325,40 @@ class EltwiseProductLayer : public Layer<Dtype> {
 };

 template <typename Dtype>
+class MemoryDataLayer : public Layer<Dtype> {


sorry for the nitpick, but can you put this class in the right place in the file alphabetically?

Fixed. All nitpicks are welcome. I missed this because of the double-alphabetical order of the file.

shelhamer · 2014-05-02T06:32:20Z

Please rebase for a clean merge. I'd love to include this soon!

This allows a blob to be updated without copy to use already existing memory (and will support MemoryDataLayer).

This will facilitate input size checking for pycaffe (and potentially others).

Doing this, rather than constructing the CaffeNet wrapper every time, will allow the wrapper to hold references that last at least as long as SGDSolver (which will be necessary to ensure that data used by MemoryDataLayer doesn't get freed).

This requires a net whose first layer is a MemoryDataLayer.

longjon · 2014-05-02T20:53:49Z

Rebased and ready-to-go.

Add a layer for in-memory data, and expose it to Python

shelhamer · 2014-05-02T21:11:37Z

Thanks Jon!

Add a layer for in-memory data, and expose it to Python

longjon mentioned this pull request Apr 5, 2014

Expose SGDSolver to pycaffe #286

Merged

shelhamer added the enhancement label Apr 8, 2014

shelhamer self-assigned this Apr 9, 2014

longjon mentioned this pull request Apr 25, 2014

Can you use python to train a network from scratch? #360

Closed

jeffdonahue reviewed Apr 25, 2014
View reviewed changes

shelhamer assigned jeffdonahue and unassigned shelhamer Apr 25, 2014

zergylord mentioned this pull request May 2, 2014

Online Processing and Snapshots #381

Closed

longjon added 9 commits May 2, 2014 13:24

add set_cpu_data to Blob and SyncedMemory

c8e6e94

This allows a blob to be updated without copy to use already existing memory (and will support MemoryDataLayer).

add MemoryDataLayer for reading input from contiguous blocks of memory

298a27c

add size accessors to MemoryDataLayer

f87fe62

This will facilitate input size checking for pycaffe (and potentially others).

add basic tests for MemoryDataLayer

ce30b67

pycaffe: let boost pass shared_ptr<CaffeNet>

634a382

pycaffe: add Net.set_input_arrays for input from numpy

76c2554

This requires a net whose first layer is a MemoryDataLayer.

pycaffe: allow 1d labels to be passed to set_input_arrays

d2fac2d

fix lint error in syncedmem.hpp

e841d32

shelhamer assigned shelhamer and unassigned jeffdonahue May 2, 2014

shelhamer added a commit that referenced this pull request May 2, 2014

Merge pull request #294 from longjon/memory-data-layer

c4146fe

Add a layer for in-memory data, and expose it to Python

shelhamer merged commit c4146fe into BVLC:dev May 2, 2014

shelhamer mentioned this pull request May 2, 2014

Feed the ImageDataLayer with OpenCV images directly from memory #251

Closed

shelhamer mentioned this pull request May 20, 2014

Next: 0.999 #429

Merged

mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014

Merge pull request BVLC#294 from longjon/memory-data-layer

183e9e3

Add a layer for in-memory data, and expose it to Python

longjon mentioned this pull request Oct 29, 2014

Python MemoryDataLayer interface suggestion #1366

Closed

longjon deleted the memory-data-layer branch December 30, 2014 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a layer for in-memory datasets, and expose it to Python #294

Add a layer for in-memory datasets, and expose it to Python #294

longjon commented Apr 5, 2014

shelhamer commented Apr 5, 2014

longjon commented Apr 5, 2014

kloudkl commented Apr 5, 2014

kloudkl commented Apr 5, 2014

shelhamer commented Apr 5, 2014

shelhamer commented Apr 9, 2014

jlreyes commented Apr 17, 2014

longjon commented Apr 18, 2014

longjon commented Apr 25, 2014

longjon commented Apr 25, 2014

jeffdonahue Apr 25, 2014

longjon Apr 25, 2014

shelhamer commented May 2, 2014

longjon commented May 2, 2014

shelhamer commented May 2, 2014

Add a layer for in-memory datasets, and expose it to Python #294

Add a layer for in-memory datasets, and expose it to Python #294

Conversation

longjon commented Apr 5, 2014

shelhamer commented Apr 5, 2014

longjon commented Apr 5, 2014

kloudkl commented Apr 5, 2014

kloudkl commented Apr 5, 2014

shelhamer commented Apr 5, 2014

shelhamer commented Apr 9, 2014

jlreyes commented Apr 17, 2014

longjon commented Apr 18, 2014

longjon commented Apr 25, 2014

longjon commented Apr 25, 2014

jeffdonahue Apr 25, 2014

Choose a reason for hiding this comment

longjon Apr 25, 2014

Choose a reason for hiding this comment

shelhamer commented May 2, 2014

longjon commented May 2, 2014

shelhamer commented May 2, 2014