Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boost::python vs. cython and Python interface preprocessing profiling and improvement #941

Closed
kloudkl opened this issue Aug 17, 2014 · 27 comments

Comments

@kloudkl
Copy link
Contributor

kloudkl commented Aug 17, 2014

Boost.Python is too slow for a practical application system. It takes several hours to extract features for only ten thousand images using the current pycaffe interface. There shouldn't be such a huge performance gap between the C++ version and the other languages. According to some benchmark results, Cython can be much faster than Boost.Python. It Caffe wants the users to frequently use the concise Python interface in their daily experiments, the binding technology should be changed to Cython.

Simple benchmark between Cython and Boost.Python
http://blog.chrischou.org/2010/02/28/simple-benchmark-between-cython-and-boost-python/

C++ wrapper benchmark: Cython, PyBindGen, Boost
https://groups.google.com/forum/#!topic/cython-users/lQO9lGj5JEc

[Stackless] [C++-sig] [Boost] Trouble optimizing Boost.Python integration for game development (it seems too slow)
http://www.stackless.com/pipermail/stackless/2009-August/004249.html

Python vs. Cython vs. D (PyD) vs. C++ (SWIG)
http://prabhuramachandran.blogspot.com/2008/09/python-vs-cython-vs-d-pyd-vs-c-swig.html

https://github.com/cython/cython/wiki/SWIG

@shelhamer
Copy link
Member

Regarding the performance of the Python interface I'm much more suspicious
of the preprocessing code (which I am guilty for) than anything having to
do with Boost Python. It's been lackluster in my profiling.

The wrapper preprocessing is not only slow but error-prone, since
everything has to match the training configuration. I think the solution is
to add an input layer to replace the oddball "input" proto fields and
encapsulate the preprocessing into a DataTransformer class that all the
data layers can invoke during prefetch. It wouldn't be a separate layer to
avoid time and memory overhead but at least the redundant code would be cut
out. It could be configured in the prototxt in a similar way to FillerParam.

To the point about Cython vs. Boost Python, we would need a benchmark.
Perhaps a simple start of loading a net and calling forward would let us
compare.

I've worked with Cython in the past but actually thought Boost Python would
be faster since it doesn't autogenerate so much code. I thought the trade
off was that boost python is slower to compile but runs fast. Only
profiling can tell.

On Saturday, August 16, 2014, kloudkl <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Boost.Python is too slow for a practical application system. It takes
several hours to extract features for only ten thousand images using the
current pycaffe interface. There shouldn't be such a huge performance gap
between the C++ version and the other languages. According some benchmark
results, Cython can be super faster than Boost.Python. It Caffe wants the
users to frequently use the concise Python interface in their daily
experiments, the binding technology should be changed to Cython.

Simple benchmark between Cython and Boost.Python

http://blog.chrischou.org/2010/02/28/simple-benchmark-between-cython-and-boost-python/

C++ wrapper benchmark: Cython, PyBindGen, Boost
https://groups.google.com/forum/#!topic/cython-users/lQO9lGj5JEc

[Stackless] [C++-sig] [Boost] Trouble optimizing Boost.Python integration
for game development (it seems too slow)
http://www.stackless.com/pipermail/stackless/2009-August/004249.html


Reply to this email directly or view it on GitHub
#941.

Evan Shelhamer

@kloudkl
Copy link
Contributor Author

kloudkl commented Aug 17, 2014

@kloudkl
Copy link
Contributor Author

kloudkl commented Aug 17, 2014

If the trouble is indeed caused by the data prepeocessing, the temporary solution can be a wrapper over the ImageDataLayer that directly transforms the inputs from memory (#251).

@kloudkl
Copy link
Contributor Author

kloudkl commented Aug 17, 2014

@GeenuX, you did an excellent work in #710. Are you interested in getting rid of the redundant data transformations from the data layers?

@bhack
Copy link
Contributor

bhack commented Aug 17, 2014

I don't know if we want to explore numba

@bhack
Copy link
Contributor

bhack commented Aug 17, 2014

I think if we want to handle matrix directly in python numba typed and cython memoryview are the fastest way to do it.

@shelhamer
Copy link
Member

The interfaces to the library should be as thin as possible as long as the
wrapper code stays concise. We have optimized and continue to optimize
Caffe, and ideally we'll not repeat all our efforts in our interfaces but
rely on the core library instead.

However if some part of the Pythob wrapper is found to be too slow then
numbah etc. should be considered.

On Sunday, August 17, 2014, bhack notifications@github.com wrote:

I think if we want to handle matrix directly in python numba typed and
cython memoryview are the fastest way to do it.


Reply to this email directly or view it on GitHub
#941 (comment).

@longjon
Copy link
Contributor

longjon commented Aug 18, 2014

Boost::Python is not a bottleneck in the wrapper.

(That aside, we can debate whether Boost::Python or Cython are better wrapper languages for Caffe; I've personally found Cython to be a bit nicer to work with in the past.)

In Python:

$ ipython --no-banner

In [1]: import caffe

In [2]: net = caffe.Net('examples/imagenet/imagenet_deploy.prototxt')
[...]

In [3]: net.set_mode_gpu()

In [4]: %timeit net.forward()
10 loops, best of 3: 32.2 ms per loop

In C++:

$ build/tools/caffe time -model examples/imagenet/imagenet_deploy.prototxt -gpu 0
[...]
I0817 21:29:02.468904  8182 caffe.cpp:175] *** Benchmark begins ***
I0817 21:29:02.468911  8182 caffe.cpp:176] Testing for 50 iterations.
[...]
I0817 21:29:04.144522  8182 caffe.cpp:191] Forward pass: 1644.23 milli seconds.
[...]
I0817 21:29:05.670099  8182 caffe.cpp:209] *** Benchmark ends ***

Note that 1644.23 / 50 = 32.88. There is no noticeable cost to using the Python wrapper, at least through the Net.forward interface, and therefore no noticeable cost from Boost::Python.

These tests were done on a GTX 770. Note that these speeds correspond to ~3ms/image, so feature extraction for ten thousand images should take less than a minute.

Any additional cost comes from preprocessing, which can be slow (especially image resizing). Where possible, you may want to externally resize images ahead of time; note that this is the standard procedure for ImageNet training, for example.

It might be informative to look at a profile for your application, if it's using the latest Python wrapper. Recent changes should have improved preprocessing speed, but if an unreasonable bottleneck still exists, it would be nice to know where it is (even though it may go away following the plan described by @shelhamer above).

@arntanguy
Copy link
Contributor

@kloudkl @shelhamer I might have a bit of time to spare on taking care of the data redundancy in the data layers. You guys seem to have a pretty good idea about what should be done, so if you could share more details about how it ought to be approached, that'd be nice. I suppose the DataTransformer would have to take in the raw Datum, and apply the necessary transformations as described by a new structure DataTransformerParameter in the prototxt, and write the transformed data directly to the top blob (to avoid needless copies).

This would at least remove redundancy in Image and Data layers (and also my own input layer for Siamese networks). I haven't had a proper look at the other data layers so far.

Also, I have no experience whatsoever with the python wrappers, so I'm not sure how it would all fit together. Let me know if there are specific points I should be wary of.
As I am not sure in what context this new class would have to be used, besides the already existing data layers, I'll wait for your input before starting any developpement.

@bhack
Copy link
Contributor

bhack commented Aug 18, 2014

How transformation is related to augmentation? See #701

@kloudkl
Copy link
Contributor Author

kloudkl commented Aug 19, 2014

@GeenuX, your design is very straightforward. Although it is nice to wrap the DataTransformer to avoid data preprocessing in Python, it is more important to eliminate redundancy in the existing data layers.

There is no clear boundary between data augmentation and data preprocessing transformations. Common transformations are cropping, subtracting the mean image/value and resizing. Augmentations include rotation, translation, zooming (re-scaling), mirroring (flipping) and color perturbation etc.

@kloudkl
Copy link
Contributor Author

kloudkl commented Aug 19, 2014

The real problem that I face is that the convolutional layers are extremely slow on CPU. Surprisingly, this type of layer has in fact been sped up 4.5 times several months ago [1].

[1] Max Jaderberg, Andrea Vedaldi, Andrew Zisserman. Speeding up Convolutional Neural Networks with Low Rank Expansions. BMVC 2014.

@bhack
Copy link
Contributor

bhack commented Aug 19, 2014

@kloudkl Seems that this work was done on caffe. Are diffs/patches/fork available?

@kloudkl
Copy link
Contributor Author

kloudkl commented Aug 19, 2014

There is no publicly available implementation of the CPU algorithm proposed in the paper. But their optimization is orthogonal of hardware specific acceleration methods. The sample of the separable convolution provided by the CUDA SDK might help [1].

[1] Victor Podlozhnyuk. Image Convolution with CUDA. Nvidia Corporation, 2007.

@kloudkl kloudkl changed the title Boost.Python is slow, Cython is several times faster 2D convolution is slow, 1D approximations are several times faster Aug 19, 2014
@kloudkl
Copy link
Contributor Author

kloudkl commented Aug 19, 2014

It seems that there are not too many changes to implement the two-stage 1D convolutions. In the forward propagation, The original 2D kernel is replaced by two 1D kernels as shown below.

// First, im2col
      im2col_cpu(bottom_data + bottom[i]->offset(n), channels_, height_,
          width_, kernel_h_, kernel_w_, pad_h_, pad_w_, stride_h_, stride_w_,
          col_data);
      // Second, innerproduct with groups
      for (int g = 0; g < group_; ++g) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, N_, K_,
          (Dtype)1., weight + weight_offset * g, col_data + col_offset * g,
          (Dtype)0., top_data + (*top)[i]->offset(n) + top_offset * g);
      }
      // third, add bias
      if (bias_term_) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num_output_,
            N_, 1, (Dtype)1., this->blobs_[1]->cpu_data(),
            bias_multiplier_.cpu_data(),
            (Dtype)1., top_data + (*top)[i]->offset(n));
      }
// First, im2col using the vertical kernel
      im2col_cpu(bottom_data + bottom[i]->offset(n), channels_, height_,
          width_, kernel_h_, 1, pad_h_, pad_w_, stride_h_, stride_w_,
          col_data);
      // Second, innerproduct with groups using the vertical kernel
      for (int g = 0; g < group_; ++g) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, MID_N_, VERTICAL_K_,
          (Dtype)1., weight + weight_offset * g, col_data + col_offset * g,
          (Dtype)0., mid_data + (*mid)[i]->offset(n) + mid_offset * g);
      }
      // Third, add vertical bias
      if (vertical_bias_term_) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num_output_,
            N_, 1, (Dtype)1., this->blobs_[1]->cpu_data(),
            vertical_bias_multiplier_.cpu_data(),
            (Dtype)1., mid_data + (*mid)[i]->offset(n));
      }

// Fourth, im2col again using the horizontal kernel
// Maybe unnecessary
      im2col_cpu(mid_data + mid[i]->offset(n), channels_, height_,
          width_, 1, kernel_w_, pad_h_, pad_w_, stride_h_, stride_w_,
          col_data);
      // Fifth, innerproduct with groups using the horizontal kernel
      for (int g = 0; g < group_; ++g) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, MID_M_, N_, HORIZONTAL_K_,
          (Dtype)1., weight + weight_offset * g, col_data + col_offset * g,
          (Dtype)0., top_data + (*top)[i]->offset(n) + top_offset * g);
      }
      // Sixth, add horizontal bias
      if (horizontal_bias_term_) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num_output_,
            N_, 1, (Dtype)1., mid_buffer->cpu_data(),
            horizontal_bias_multiplier_.cpu_data(),
            (Dtype)1., top_data + (*top)[i]->offset(n));
      }

The case for the backward propagation is very similar.

@shelhamer shelhamer changed the title 2D convolution is slow, 1D approximations are several times faster boost::python vs. cython and Python interface preprocessing profiling and improvement Aug 20, 2014
@shelhamer
Copy link
Member

@kloudkl please don't change the subject in the middle of the thread. I've reverted the title to the Python interface and preprocessing speed.

For separable convolutions and the low-rank speedup conversation started in #941 (comment) please open another issue if it's about further development or on the mailing list if it's general discussion.

Note that Caffe can already do separable spatial filters as of #505. See the Sobel operator convolution test for an example: https://github.com/BVLC/caffe/blob/master/src/caffe/test/test_convolution_layer.cpp#L174-L268. You can accomplish it without further coding by composing convolution layers: one for the horizontal and the other for the vertical.

@shelhamer
Copy link
Member

@GeenuX a PR to introduce the DataTransformer class, add a TransformationParameter in caffe.proto, and refactor the DataLayer and ImageDatalayer to work this way would be excellent!

Do not worry about the Python interface integration and the rest of the data layers. Their refactoring can follow once the infrastructure is in place and DataLayer has been rewritten to serve as an example. The DataTransformer class is intended for use with the current data layers. Further generalizations for augmentations are out of the scope of the initial development and can follow later.

@kloudkl
Copy link
Contributor Author

kloudkl commented Aug 20, 2014

@shelhamer, thank you very much for the new powerful convolution layer!

Compared with the abundant features added in each release, there are far less API and tutorial documentations to widely announce all the possibilities that can be achieved and how to exploit them. Not all of the users are able to or have the time to follow up all the code changes. Many of them just want to solve specific use cases quickly with the help of the docs. I will benchmark the separable convolution. If it is way faster than the 2D convolution, I will also write a tutorial.

@shelhamer
Copy link
Member

@kloudkl thank you for the benchmarking and potential tutorial. I agree that there could be much more thorough API and tutorial documentation. Now that our recent improvements are in the latest release we will be turning our attention to catching up with documentation soon.

@bhack
Copy link
Contributor

bhack commented Aug 20, 2014

How will interact with #569?

@shelhamer
Copy link
Member

Closing since Boost::Python incurs no overhead as seen in #941 (comment) and the preprocessing is on the way to simplification and efficiency with #954 and #963 and a further follow-up to expose data transformations to Python by the MemoryDataLayer through the DataTransformer class.

@nicodjimenez
Copy link

@kloudkl you say there is no clear boundary between data augmentation and data preprocessing, but I think one should be careful when saying this. I have currently implemented on the fly data augmentation within caffe by modifying datum attributes. I have found it very useful to perform different data augmentations that depend on the labels themselves. For example, for MNIST, one will want to be more careful about rotating "1"s, as they may be confused with "7"s, than "3"s, whose labels are more invariant to rotation. I also don't want to augment the test data, so I add a datum attribute that tells me whether the data is test or train, and check this attribute before modifying the datum's attribute. Not pretty, but very fast and effective. This can probably be done within a data layer, but one might have to have separate data layers for testing and training data, unless I'm missing something.

This is just food for thought to anyone thinking of implementing on the fly augmentation as part of a data layer.

@bhack
Copy link
Contributor

bhack commented Aug 28, 2014

@nicodjimenez @kloudkl This is closed. We could continue data augmentation discussion at #701

@alfredox10
Copy link

I'm trying to implement an object detection python program that uses Caffe, and I'm trying to make it faster but I have not found any information on how to make it compatible with PyPy. Does anyone know how to achieve this?

@SlimeQ
Copy link

SlimeQ commented Jul 22, 2015

bruh. it's just not compatible. if you're looking for speed, writing custom C++ functions and adding them to the wrapper is probably your best bet.

i'm really not trying to be dismissive here, i just spent a lot of time looking into pypy/caffe last week only to find that it's more complicated than it's worth. if you can find a way to pass data to caffe without using numpy, you might be able to get something going, but i'm not totally sure how that'd be done.

@alfredox10
Copy link

Hey Slime! lol

I can pass the data to caffe without using numpy, but you said caffe isn't compatible with pypy either, so I'm not sure that would help though?

@SlimeQ
Copy link

SlimeQ commented Jul 22, 2015

nah, you're thinking pycaffe. caffe is c++, it doesn't even know that pypy exists. the problem is that caffe needs a contiguous array of floats, and python lists are basically arrays of pointers to floats (or more accurately, pointers to doubles) so they need to be converted to contiguous memory to be operated on efficiently. typically numpy is used for this, but numPyPy is still in development and as such it is missing some functionality.

you can certainly try the latest numPyPy implementation but there's no guarantee that it will be faster or even work at all. that being said, please do let us know how that goes as i'm sure there are a lot of people around here wanting more speed out of caffe without getting down and dirty with C++.

in any case, this probably isn't the proper thread to discuss numPyPy ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants