Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a layer for in-memory datasets, and expose it to Python #294

Merged
merged 9 commits into from
May 2, 2014

Conversation

longjon
Copy link
Contributor

@longjon longjon commented Apr 5, 2014

(This PR is related to but distinct from #196. Together with #286 it addresses #135.)

This PR adds a new layer called MemoryDataLayer that accepts two contiguous blocks of memory (for data and labels) and (in Forward) updates the top blobs to walk along the provided memory.

  • Raw pointers are used by MemoryDataLayer; should shared_ptrs be used instead?
  • Blob and SyncedMemory both get a set_cpu_data method to allow them to point to memory owned by someone else
  • As a result, no copying is performed
  • As a result, input data length must be a multiple of batch size
  • A special method, Net.set_input_arrays is added to pycaffe for telling the MemoryDataLayer to point to ndarray data
  • Labels are passed into this method as a 4d array for uniformity (but could be a 1d array for convenience?)
  • Tests do not yet exist for MemoryDataLayer (perhaps merging should wait until they do?)

When combined with #286, training in Python with ndarray data is now possible, e.g.,

solver = caffe.SGDSolver('solver.prototxt')
solver.net.set_input_arrays(data, labels)
solver.solve()

@shelhamer
Copy link
Member

I can't wait for this data layer!

  • Raw pointers are used by MemoryDataLayer; should shared_ptrs be used instead?

The raw pointer is fine. A shared_ptr would only be needed if the ndarray or MemoryDataLayer wanted to keep ownership after one or the other went away... and that shouldn't come up.

  • Blob and SyncedMemory both get a set_cpu_data method to allow them to point to memory owned by someone else
  • As a result, no copying is performed

Nice.

  • As a result, input data length must be a multiple of batch size

This should be explicitly checked. It'd be nice to explicitly check this, but better to pad with zeros if need be. One would want to politely return only the non-zero outputs on the last Forward(), but perhaps I'll take care of that in my #291.

  • A special method, Net.set_input_arrays is added to pycaffe for telling the MemoryDataLayer to point to ndarray data

To me, the ideal interface would be to simply assign to net.blobs['input'].data, but I like how you've done it here and this is worth a special case method.

  • Labels are passed into this method as a 4d array for uniformity (but could be a 1d array for convenience?)

Accept both. The 4d will be important for multilabel and regression inputs and we should sort it out now instead of bolting it on later. However, a 1d array is convenient, like you said, so why not accept a 1d arg and upgrade it to a 4d array?

  • Tests do not yet exist for MemoryDataLayer (perhaps merging should wait until they do?)

Tests are a must.

@longjon
Copy link
Contributor Author

longjon commented Apr 5, 2014

  • Right, shared_ptr is not a Python concern; in fact that is handled by reference counting via boost by storing boost::python::objects. One might in the future want to use the MemoryDataLayer from C++, and then shared_ptr might be the way to go; but we can (and probably should) postpone that decision until the case arises.
  • Data length is checked to be a multiple of the batch size. Zero padding is not so straightforward; one does not want to train on zero inputs with zero labels! Perhaps partial-batch training could be worked out, but that's not so straightforward either; Backward passes compute gradients per-batch. Maybe Make SyncedMem and Blob be able to increase capacities without deallocation and reallocation #250 will help, maybe not. Or we can give up on no-copy.
  • I considered just having an input blob that stores all the data. But, this double allocates for the diff, which isn't needed.
  • I'll add accepting 1d labels. (There's actually no extra work to do, since the memory layout is the same.)
  • Tests will come in the next couple days.

@kloudkl
Copy link
Contributor

kloudkl commented Apr 5, 2014

There is no reason to fix the batch size. #250 is paving the way for #195.

@kloudkl
Copy link
Contributor

kloudkl commented Apr 5, 2014

Don't worry about using a Blob to store all the data. The original SyncedMem and thus Blob do not allocate any memory until it is used. #250 also strives to be as lazy as possible.

@shelhamer
Copy link
Member

Re: data length everything's fine. No copy is worth it, and don't worry
about partial batches. I wasn't thinking. (Sorry I didn't check to see it
was already checked.)

Le samedi 5 avril 2014, longjon notifications@github.com a écrit :

  • Right, shared_ptr is not a Python concern; in fact that is handled
    by reference counting via boost by storing boost::python::objects. One
    might in the future want to use the MemoryDataLayer from C++, and then
    shared_ptr might be the way to go; but we can (and probably should)
    postpone that decision until the case arises.
  • Data length is checked to be a multiple of the batch size. Zero
    padding is not so straightforward; one does not want to train on zero
    inputs with zero labels! Perhaps partial-batch training could be worked
    out, but that's not so straightforward either; Backward passes compute
    gradients per-batch. Maybe Make SyncedMem and Blob be able to increase capacities without deallocation and reallocation #250https://github.com/BVLC/caffe/pull/250will help, maybe not. Or we can give up on no-copy.
  • I considered just having an input blob that stores all the data.
    But, this double allocates for the diff, which isn't needed.
  • I'll add accepting 1d labels. (There's actually no extra work to do,
    since the memory layout is the same.)
  • Tests will come in the next couple days.


Reply to this email directly or view it on GitHubhttps://github.com//pull/294#issuecomment-39631179
.

@shelhamer shelhamer self-assigned this Apr 9, 2014
@shelhamer
Copy link
Member

Needs rebase + tests, then I'll review and merge. Thanks @longjon.

@jlreyes
Copy link

jlreyes commented Apr 17, 2014

Wow, this is awesome, I was looking to implement something similar to this for my project, but I'm glad I saw this first! Thanks @longjon!

In my project my outputs aren't labels, so they don't have the required Nx1x1x1 dimensions. There's nothing inherently preventing me from getting this to work with non label outputs, correct?

If I modified the MemoryDataParameter to take in both input and output dimension information, modified the label dimension checks, and then modified line 42 in memory_data_layer.cpp appropriately, everything should work?

I only recently started getting familiar with the caffe source, so forgive me for my ignorance. Just wanted to make sure I wasn't missing anything fundamental!

@longjon
Copy link
Contributor Author

longjon commented Apr 18, 2014

@jlreyes yes, everything should work the way you describe. You'll also need to modify the checks in the Python wrapper if you'll be using that.

We could consider generalizing this layer to take an arbitrary specification of any number of input blobs of various sizes. That actually feels more natural to me than the way it's done now, but I'll probably not make those changes for this PR.

(This PR is almost ready for review, I just need to chase down a possible crash.)

@longjon
Copy link
Contributor Author

longjon commented Apr 25, 2014

Bug fixed, code rebased, history rewritten, basic tests added, 1d label arrays accepted, all tests and lint pass, ready for review and merge.

@longjon
Copy link
Contributor Author

longjon commented Apr 25, 2014

For some reason these changes expose a lint error that make lint was silent about on dev. The last commit fixes this.

@@ -325,6 +325,40 @@ class EltwiseProductLayer : public Layer<Dtype> {
};

template <typename Dtype>
class MemoryDataLayer : public Layer<Dtype> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the nitpick, but can you put this class in the right place in the file alphabetically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. All nitpicks are welcome. I missed this because of the double-alphabetical order of the file.

@shelhamer shelhamer assigned jeffdonahue and unassigned shelhamer Apr 25, 2014
@shelhamer
Copy link
Member

Please rebase for a clean merge. I'd love to include this soon!

This allows a blob to be updated without copy to use already existing
memory (and will support MemoryDataLayer).
This will facilitate input size checking for pycaffe (and potentially
others).
Doing this, rather than constructing the CaffeNet wrapper every time,
will allow the wrapper to hold references that last at least as long as
SGDSolver (which will be necessary to ensure that data used by
MemoryDataLayer doesn't get freed).
This requires a net whose first layer is a MemoryDataLayer.
@longjon
Copy link
Contributor Author

longjon commented May 2, 2014

Rebased and ready-to-go.

@shelhamer shelhamer assigned shelhamer and unassigned jeffdonahue May 2, 2014
shelhamer added a commit that referenced this pull request May 2, 2014
Add a layer for in-memory data, and expose it to Python
@shelhamer shelhamer merged commit c4146fe into BVLC:dev May 2, 2014
@shelhamer
Copy link
Member

Thanks Jon!

@shelhamer shelhamer mentioned this pull request May 20, 2014
mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014
Add a layer for in-memory data, and expose it to Python
@longjon longjon deleted the memory-data-layer branch December 30, 2014 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants