boost::python vs. cython and Python interface preprocessing profiling and improvement #941

kloudkl · 2014-08-17T02:36:43Z

Boost.Python is too slow for a practical application system. It takes several hours to extract features for only ten thousand images using the current pycaffe interface. There shouldn't be such a huge performance gap between the C++ version and the other languages. According to some benchmark results, Cython can be much faster than Boost.Python. It Caffe wants the users to frequently use the concise Python interface in their daily experiments, the binding technology should be changed to Cython.

Simple benchmark between Cython and Boost.Python
http://blog.chrischou.org/2010/02/28/simple-benchmark-between-cython-and-boost-python/

C++ wrapper benchmark: Cython, PyBindGen, Boost
https://groups.google.com/forum/#!topic/cython-users/lQO9lGj5JEc

[Stackless] [C++-sig] [Boost] Trouble optimizing Boost.Python integration for game development (it seems too slow)
http://www.stackless.com/pipermail/stackless/2009-August/004249.html

Python vs. Cython vs. D (PyD) vs. C++ (SWIG)
http://prabhuramachandran.blogspot.com/2008/09/python-vs-cython-vs-d-pyd-vs-c-swig.html

https://github.com/cython/cython/wiki/SWIG

shelhamer · 2014-08-17T02:57:50Z

Regarding the performance of the Python interface I'm much more suspicious
of the preprocessing code (which I am guilty for) than anything having to
do with Boost Python. It's been lackluster in my profiling.

The wrapper preprocessing is not only slow but error-prone, since
everything has to match the training configuration. I think the solution is
to add an input layer to replace the oddball "input" proto fields and
encapsulate the preprocessing into a DataTransformer class that all the
data layers can invoke during prefetch. It wouldn't be a separate layer to
avoid time and memory overhead but at least the redundant code would be cut
out. It could be configured in the prototxt in a similar way to FillerParam.

To the point about Cython vs. Boost Python, we would need a benchmark.
Perhaps a simple start of loading a net and calling forward would let us
compare.

I've worked with Cython in the past but actually thought Boost Python would
be faster since it doesn't autogenerate so much code. I thought the trade
off was that boost python is slower to compile but runs fast. Only
profiling can tell.

On Saturday, August 16, 2014, kloudkl <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Boost.Python is too slow for a practical application system. It takes
several hours to extract features for only ten thousand images using the
current pycaffe interface. There shouldn't be such a huge performance gap
between the C++ version and the other languages. According some benchmark
results, Cython can be super faster than Boost.Python. It Caffe wants the
users to frequently use the concise Python interface in their daily
experiments, the binding technology should be changed to Cython.

Simple benchmark between Cython and Boost.Python

http://blog.chrischou.org/2010/02/28/simple-benchmark-between-cython-and-boost-python/

C++ wrapper benchmark: Cython, PyBindGen, Boost
https://groups.google.com/forum/#!topic/cython-users/lQO9lGj5JEc

[Stackless] [C++-sig] [Boost] Trouble optimizing Boost.Python integration
for game development (it seems too slow)
http://www.stackless.com/pipermail/stackless/2009-August/004249.html

—
Reply to this email directly or view it on GitHub
#941.

Evan Shelhamer

kloudkl · 2014-08-17T03:04:03Z

PyPy claims to be faster than Cython.
http://speed.pypy.org/

Accelerating your Python application: Cython and PyPy
https://speakerdeck.com/gnprice/accelerating-your-python-application-cython-and-pypy

Unfortunately, it is only suitable for pure Python codes.
http://stackoverflow.com/questions/18946662/why-shouldnt-i-use-pypy-over-cpython-if-pypy-is-6-3-times-faster
https://groups.google.com/forum/#!topic/cython-users/OwAIcJwWH14

kloudkl · 2014-08-17T03:09:19Z

If the trouble is indeed caused by the data prepeocessing, the temporary solution can be a wrapper over the ImageDataLayer that directly transforms the inputs from memory (#251).

kloudkl · 2014-08-17T03:14:06Z

@GeenuX, you did an excellent work in #710. Are you interested in getting rid of the redundant data transformations from the data layers?

bhack · 2014-08-17T10:17:30Z

I don't know if we want to explore numba

bhack · 2014-08-17T10:58:23Z

I think if we want to handle matrix directly in python numba typed and cython memoryview are the fastest way to do it.

shelhamer · 2014-08-17T15:42:39Z

The interfaces to the library should be as thin as possible as long as the
wrapper code stays concise. We have optimized and continue to optimize
Caffe, and ideally we'll not repeat all our efforts in our interfaces but
rely on the core library instead.

However if some part of the Pythob wrapper is found to be too slow then
numbah etc. should be considered.

On Sunday, August 17, 2014, bhack notifications@github.com wrote:

I think if we want to handle matrix directly in python numba typed and
cython memoryview are the fastest way to do it.

—
Reply to this email directly or view it on GitHub
#941 (comment).

longjon · 2014-08-18T04:59:43Z

Boost::Python is not a bottleneck in the wrapper.

(That aside, we can debate whether Boost::Python or Cython are better wrapper languages for Caffe; I've personally found Cython to be a bit nicer to work with in the past.)

In Python:

$ ipython --no-banner

In [1]: import caffe

In [2]: net = caffe.Net('examples/imagenet/imagenet_deploy.prototxt')
[...]

In [3]: net.set_mode_gpu()

In [4]: %timeit net.forward()
10 loops, best of 3: 32.2 ms per loop

In C++:

$ build/tools/caffe time -model examples/imagenet/imagenet_deploy.prototxt -gpu 0
[...]
I0817 21:29:02.468904  8182 caffe.cpp:175] *** Benchmark begins ***
I0817 21:29:02.468911  8182 caffe.cpp:176] Testing for 50 iterations.
[...]
I0817 21:29:04.144522  8182 caffe.cpp:191] Forward pass: 1644.23 milli seconds.
[...]
I0817 21:29:05.670099  8182 caffe.cpp:209] *** Benchmark ends ***

Note that 1644.23 / 50 = 32.88. There is no noticeable cost to using the Python wrapper, at least through the Net.forward interface, and therefore no noticeable cost from Boost::Python.

These tests were done on a GTX 770. Note that these speeds correspond to ~3ms/image, so feature extraction for ten thousand images should take less than a minute.

Any additional cost comes from preprocessing, which can be slow (especially image resizing). Where possible, you may want to externally resize images ahead of time; note that this is the standard procedure for ImageNet training, for example.

It might be informative to look at a profile for your application, if it's using the latest Python wrapper. Recent changes should have improved preprocessing speed, but if an unreasonable bottleneck still exists, it would be nice to know where it is (even though it may go away following the plan described by @shelhamer above).

arntanguy · 2014-08-18T16:46:51Z

@kloudkl @shelhamer I might have a bit of time to spare on taking care of the data redundancy in the data layers. You guys seem to have a pretty good idea about what should be done, so if you could share more details about how it ought to be approached, that'd be nice. I suppose the DataTransformer would have to take in the raw Datum, and apply the necessary transformations as described by a new structure DataTransformerParameter in the prototxt, and write the transformed data directly to the top blob (to avoid needless copies).

This would at least remove redundancy in Image and Data layers (and also my own input layer for Siamese networks). I haven't had a proper look at the other data layers so far.

Also, I have no experience whatsoever with the python wrappers, so I'm not sure how it would all fit together. Let me know if there are specific points I should be wary of.
As I am not sure in what context this new class would have to be used, besides the already existing data layers, I'll wait for your input before starting any developpement.

bhack · 2014-08-18T17:50:33Z

How transformation is related to augmentation? See #701

kloudkl · 2014-08-19T03:34:37Z

@GeenuX, your design is very straightforward. Although it is nice to wrap the DataTransformer to avoid data preprocessing in Python, it is more important to eliminate redundancy in the existing data layers.

There is no clear boundary between data augmentation and data preprocessing transformations. Common transformations are cropping, subtracting the mean image/value and resizing. Augmentations include rotation, translation, zooming (re-scaling), mirroring (flipping) and color perturbation etc.

kloudkl · 2014-08-19T06:57:43Z

The real problem that I face is that the convolutional layers are extremely slow on CPU. Surprisingly, this type of layer has in fact been sped up 4.5 times several months ago [1].

[1] Max Jaderberg, Andrea Vedaldi, Andrew Zisserman. Speeding up Convolutional Neural Networks with Low Rank Expansions. BMVC 2014.

bhack · 2014-08-19T13:01:24Z

@kloudkl Seems that this work was done on caffe. Are diffs/patches/fork available?

kloudkl · 2014-08-19T14:07:19Z

There is no publicly available implementation of the CPU algorithm proposed in the paper. But their optimization is orthogonal of hardware specific acceleration methods. The sample of the separable convolution provided by the CUDA SDK might help [1].

[1] Victor Podlozhnyuk. Image Convolution with CUDA. Nvidia Corporation, 2007.

kloudkl · 2014-08-19T23:55:33Z

It seems that there are not too many changes to implement the two-stage 1D convolutions. In the forward propagation, The original 2D kernel is replaced by two 1D kernels as shown below.

// First, im2col
      im2col_cpu(bottom_data + bottom[i]->offset(n), channels_, height_,
          width_, kernel_h_, kernel_w_, pad_h_, pad_w_, stride_h_, stride_w_,
          col_data);
      // Second, innerproduct with groups
      for (int g = 0; g < group_; ++g) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, N_, K_,
          (Dtype)1., weight + weight_offset * g, col_data + col_offset * g,
          (Dtype)0., top_data + (*top)[i]->offset(n) + top_offset * g);
      }
      // third, add bias
      if (bias_term_) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num_output_,
            N_, 1, (Dtype)1., this->blobs_[1]->cpu_data(),
            bias_multiplier_.cpu_data(),
            (Dtype)1., top_data + (*top)[i]->offset(n));
      }

// First, im2col using the vertical kernel
      im2col_cpu(bottom_data + bottom[i]->offset(n), channels_, height_,
          width_, kernel_h_, 1, pad_h_, pad_w_, stride_h_, stride_w_,
          col_data);
      // Second, innerproduct with groups using the vertical kernel
      for (int g = 0; g < group_; ++g) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, MID_N_, VERTICAL_K_,
          (Dtype)1., weight + weight_offset * g, col_data + col_offset * g,
          (Dtype)0., mid_data + (*mid)[i]->offset(n) + mid_offset * g);
      }
      // Third, add vertical bias
      if (vertical_bias_term_) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num_output_,
            N_, 1, (Dtype)1., this->blobs_[1]->cpu_data(),
            vertical_bias_multiplier_.cpu_data(),
            (Dtype)1., mid_data + (*mid)[i]->offset(n));
      }

// Fourth, im2col again using the horizontal kernel
// Maybe unnecessary
      im2col_cpu(mid_data + mid[i]->offset(n), channels_, height_,
          width_, 1, kernel_w_, pad_h_, pad_w_, stride_h_, stride_w_,
          col_data);
      // Fifth, innerproduct with groups using the horizontal kernel
      for (int g = 0; g < group_; ++g) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, MID_M_, N_, HORIZONTAL_K_,
          (Dtype)1., weight + weight_offset * g, col_data + col_offset * g,
          (Dtype)0., top_data + (*top)[i]->offset(n) + top_offset * g);
      }
      // Sixth, add horizontal bias
      if (horizontal_bias_term_) {
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num_output_,
            N_, 1, (Dtype)1., mid_buffer->cpu_data(),
            horizontal_bias_multiplier_.cpu_data(),
            (Dtype)1., top_data + (*top)[i]->offset(n));
      }

The case for the backward propagation is very similar.

shelhamer · 2014-08-20T00:10:06Z

@kloudkl please don't change the subject in the middle of the thread. I've reverted the title to the Python interface and preprocessing speed.

For separable convolutions and the low-rank speedup conversation started in #941 (comment) please open another issue if it's about further development or on the mailing list if it's general discussion.

Note that Caffe can already do separable spatial filters as of #505. See the Sobel operator convolution test for an example: https://github.com/BVLC/caffe/blob/master/src/caffe/test/test_convolution_layer.cpp#L174-L268. You can accomplish it without further coding by composing convolution layers: one for the horizontal and the other for the vertical.

shelhamer · 2014-08-20T01:12:10Z

@GeenuX a PR to introduce the DataTransformer class, add a TransformationParameter in caffe.proto, and refactor the DataLayer and ImageDatalayer to work this way would be excellent!

Do not worry about the Python interface integration and the rest of the data layers. Their refactoring can follow once the infrastructure is in place and DataLayer has been rewritten to serve as an example. The DataTransformer class is intended for use with the current data layers. Further generalizations for augmentations are out of the scope of the initial development and can follow later.

kloudkl · 2014-08-20T01:46:58Z

@shelhamer, thank you very much for the new powerful convolution layer!

Compared with the abundant features added in each release, there are far less API and tutorial documentations to widely announce all the possibilities that can be achieved and how to exploit them. Not all of the users are able to or have the time to follow up all the code changes. Many of them just want to solve specific use cases quickly with the help of the docs. I will benchmark the separable convolution. If it is way faster than the 2D convolution, I will also write a tutorial.

shelhamer · 2014-08-20T01:57:15Z

@kloudkl thank you for the benchmarking and potential tutorial. I agree that there could be much more thorough API and tutorial documentation. Now that our recent improvements are in the latest release we will be turning our attention to catching up with documentation soon.

bhack · 2014-08-20T02:23:56Z

How will interact with #569?

shelhamer · 2014-08-22T07:25:43Z

Closing since Boost::Python incurs no overhead as seen in #941 (comment) and the preprocessing is on the way to simplification and efficiency with #954 and #963 and a further follow-up to expose data transformations to Python by the MemoryDataLayer through the DataTransformer class.

nicodjimenez · 2014-08-27T22:03:11Z

@kloudkl you say there is no clear boundary between data augmentation and data preprocessing, but I think one should be careful when saying this. I have currently implemented on the fly data augmentation within caffe by modifying datum attributes. I have found it very useful to perform different data augmentations that depend on the labels themselves. For example, for MNIST, one will want to be more careful about rotating "1"s, as they may be confused with "7"s, than "3"s, whose labels are more invariant to rotation. I also don't want to augment the test data, so I add a datum attribute that tells me whether the data is test or train, and check this attribute before modifying the datum's attribute. Not pretty, but very fast and effective. This can probably be done within a data layer, but one might have to have separate data layers for testing and training data, unless I'm missing something.

This is just food for thought to anyone thinking of implementing on the fly augmentation as part of a data layer.

bhack · 2014-08-28T10:36:37Z

@nicodjimenez @kloudkl This is closed. We could continue data augmentation discussion at #701

alfredox10 · 2015-07-22T16:17:02Z

I'm trying to implement an object detection python program that uses Caffe, and I'm trying to make it faster but I have not found any information on how to make it compatible with PyPy. Does anyone know how to achieve this?

SlimeQ · 2015-07-22T16:41:55Z

bruh. it's just not compatible. if you're looking for speed, writing custom C++ functions and adding them to the wrapper is probably your best bet.

i'm really not trying to be dismissive here, i just spent a lot of time looking into pypy/caffe last week only to find that it's more complicated than it's worth. if you can find a way to pass data to caffe without using numpy, you might be able to get something going, but i'm not totally sure how that'd be done.

alfredox10 · 2015-07-22T17:59:45Z

Hey Slime! lol

I can pass the data to caffe without using numpy, but you said caffe isn't compatible with pypy either, so I'm not sure that would help though?

SlimeQ · 2015-07-22T20:25:56Z

nah, you're thinking pycaffe. caffe is c++, it doesn't even know that pypy exists. the problem is that caffe needs a contiguous array of floats, and python lists are basically arrays of pointers to floats (or more accurately, pointers to doubles) so they need to be converted to contiguous memory to be operated on efficiently. typically numpy is used for this, but numPyPy is still in development and as such it is missing some functionality.

you can certainly try the latest numPyPy implementation but there's no guarantee that it will be faster or even work at all. that being said, please do let us know how that goes as i'm sure there are a lot of people around here wanting more speed out of caffe without getting down and dirty with C++.

in any case, this probably isn't the proper thread to discuss numPyPy ;)

kloudkl changed the title ~~Boost.Python is slow, Cython is several times faster~~ 2D convolution is slow, 1D approximations are several times faster Aug 19, 2014

shelhamer changed the title ~~2D convolution is slow, 1D approximations are several times faster~~ boost::python vs. cython and Python interface preprocessing profiling and improvement Aug 20, 2014

arntanguy mentioned this issue Aug 20, 2014

Refactor data layers to avoid duplication of data transformation code #954

Merged

kloudkl mentioned this issue Aug 21, 2014

Document the usage of the separable convolution #957

Closed

shelhamer closed this as completed Aug 22, 2014

kloudkl mentioned this issue Aug 27, 2014

Feed the ImageDataLayer with OpenCV images directly from memory #251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

boost::python vs. cython and Python interface preprocessing profiling and improvement #941

boost::python vs. cython and Python interface preprocessing profiling and improvement #941

kloudkl commented Aug 17, 2014

shelhamer commented Aug 17, 2014

kloudkl commented Aug 17, 2014

kloudkl commented Aug 17, 2014

kloudkl commented Aug 17, 2014

bhack commented Aug 17, 2014

bhack commented Aug 17, 2014

shelhamer commented Aug 17, 2014

longjon commented Aug 18, 2014

arntanguy commented Aug 18, 2014

bhack commented Aug 18, 2014

kloudkl commented Aug 19, 2014

kloudkl commented Aug 19, 2014

bhack commented Aug 19, 2014

kloudkl commented Aug 19, 2014

kloudkl commented Aug 19, 2014

shelhamer commented Aug 20, 2014

shelhamer commented Aug 20, 2014

kloudkl commented Aug 20, 2014

shelhamer commented Aug 20, 2014

bhack commented Aug 20, 2014

shelhamer commented Aug 22, 2014

nicodjimenez commented Aug 27, 2014

bhack commented Aug 28, 2014

alfredox10 commented Jul 22, 2015

SlimeQ commented Jul 22, 2015

alfredox10 commented Jul 22, 2015

SlimeQ commented Jul 22, 2015

boost::python vs. cython and Python interface preprocessing profiling and improvement #941

boost::python vs. cython and Python interface preprocessing profiling and improvement #941

Comments

kloudkl commented Aug 17, 2014

shelhamer commented Aug 17, 2014

kloudkl commented Aug 17, 2014

kloudkl commented Aug 17, 2014

kloudkl commented Aug 17, 2014

bhack commented Aug 17, 2014

bhack commented Aug 17, 2014

shelhamer commented Aug 17, 2014

longjon commented Aug 18, 2014

arntanguy commented Aug 18, 2014

bhack commented Aug 18, 2014

kloudkl commented Aug 19, 2014

kloudkl commented Aug 19, 2014

bhack commented Aug 19, 2014

kloudkl commented Aug 19, 2014

kloudkl commented Aug 19, 2014

shelhamer commented Aug 20, 2014

shelhamer commented Aug 20, 2014

kloudkl commented Aug 20, 2014

shelhamer commented Aug 20, 2014

bhack commented Aug 20, 2014

shelhamer commented Aug 22, 2014

nicodjimenez commented Aug 27, 2014

bhack commented Aug 28, 2014

alfredox10 commented Jul 22, 2015

SlimeQ commented Jul 22, 2015

alfredox10 commented Jul 22, 2015

SlimeQ commented Jul 22, 2015