-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
boost::python vs. cython and Python interface preprocessing profiling and improvement #941
Comments
Regarding the performance of the Python interface I'm much more suspicious The wrapper preprocessing is not only slow but error-prone, since To the point about Cython vs. Boost Python, we would need a benchmark. I've worked with Cython in the past but actually thought Boost Python would On Saturday, August 16, 2014, kloudkl <notifications@github.com
Evan Shelhamer |
PyPy claims to be faster than Cython. Accelerating your Python application: Cython and PyPy Unfortunately, it is only suitable for pure Python codes. |
If the trouble is indeed caused by the data prepeocessing, the temporary solution can be a wrapper over the ImageDataLayer that directly transforms the inputs from memory (#251). |
I don't know if we want to explore numba |
I think if we want to handle matrix directly in python numba typed and cython memoryview are the fastest way to do it. |
The interfaces to the library should be as thin as possible as long as the However if some part of the Pythob wrapper is found to be too slow then On Sunday, August 17, 2014, bhack notifications@github.com wrote:
|
Boost::Python is not a bottleneck in the wrapper. (That aside, we can debate whether Boost::Python or Cython are better wrapper languages for Caffe; I've personally found Cython to be a bit nicer to work with in the past.) In Python: $ ipython --no-banner
In [1]: import caffe
In [2]: net = caffe.Net('examples/imagenet/imagenet_deploy.prototxt')
[...]
In [3]: net.set_mode_gpu()
In [4]: %timeit net.forward()
10 loops, best of 3: 32.2 ms per loop In C++:
Note that 1644.23 / 50 = 32.88. There is no noticeable cost to using the Python wrapper, at least through the These tests were done on a GTX 770. Note that these speeds correspond to ~3ms/image, so feature extraction for ten thousand images should take less than a minute. Any additional cost comes from preprocessing, which can be slow (especially image resizing). Where possible, you may want to externally resize images ahead of time; note that this is the standard procedure for ImageNet training, for example. It might be informative to look at a profile for your application, if it's using the latest Python wrapper. Recent changes should have improved preprocessing speed, but if an unreasonable bottleneck still exists, it would be nice to know where it is (even though it may go away following the plan described by @shelhamer above). |
@kloudkl @shelhamer I might have a bit of time to spare on taking care of the data redundancy in the data layers. You guys seem to have a pretty good idea about what should be done, so if you could share more details about how it ought to be approached, that'd be nice. I suppose the DataTransformer would have to take in the raw Datum, and apply the necessary transformations as described by a new structure DataTransformerParameter in the prototxt, and write the transformed data directly to the top blob (to avoid needless copies). This would at least remove redundancy in Image and Data layers (and also my own input layer for Siamese networks). I haven't had a proper look at the other data layers so far. Also, I have no experience whatsoever with the python wrappers, so I'm not sure how it would all fit together. Let me know if there are specific points I should be wary of. |
How transformation is related to augmentation? See #701 |
@GeenuX, your design is very straightforward. Although it is nice to wrap the DataTransformer to avoid data preprocessing in Python, it is more important to eliminate redundancy in the existing data layers. There is no clear boundary between data augmentation and data preprocessing transformations. Common transformations are cropping, subtracting the mean image/value and resizing. Augmentations include rotation, translation, zooming (re-scaling), mirroring (flipping) and color perturbation etc. |
The real problem that I face is that the convolutional layers are extremely slow on CPU. Surprisingly, this type of layer has in fact been sped up 4.5 times several months ago [1]. [1] Max Jaderberg, Andrea Vedaldi, Andrew Zisserman. Speeding up Convolutional Neural Networks with Low Rank Expansions. BMVC 2014. |
@kloudkl Seems that this work was done on caffe. Are diffs/patches/fork available? |
There is no publicly available implementation of the CPU algorithm proposed in the paper. But their optimization is orthogonal of hardware specific acceleration methods. The sample of the separable convolution provided by the CUDA SDK might help [1]. [1] Victor Podlozhnyuk. Image Convolution with CUDA. Nvidia Corporation, 2007. |
It seems that there are not too many changes to implement the two-stage 1D convolutions. In the forward propagation, The original 2D kernel is replaced by two 1D kernels as shown below.
The case for the backward propagation is very similar. |
@kloudkl please don't change the subject in the middle of the thread. I've reverted the title to the Python interface and preprocessing speed. For separable convolutions and the low-rank speedup conversation started in #941 (comment) please open another issue if it's about further development or on the mailing list if it's general discussion. Note that Caffe can already do separable spatial filters as of #505. See the Sobel operator convolution test for an example: https://github.com/BVLC/caffe/blob/master/src/caffe/test/test_convolution_layer.cpp#L174-L268. You can accomplish it without further coding by composing convolution layers: one for the horizontal and the other for the vertical. |
@GeenuX a PR to introduce the DataTransformer class, add a TransformationParameter in caffe.proto, and refactor the DataLayer and ImageDatalayer to work this way would be excellent! Do not worry about the Python interface integration and the rest of the data layers. Their refactoring can follow once the infrastructure is in place and DataLayer has been rewritten to serve as an example. The DataTransformer class is intended for use with the current data layers. Further generalizations for augmentations are out of the scope of the initial development and can follow later. |
@shelhamer, thank you very much for the new powerful convolution layer! Compared with the abundant features added in each release, there are far less API and tutorial documentations to widely announce all the possibilities that can be achieved and how to exploit them. Not all of the users are able to or have the time to follow up all the code changes. Many of them just want to solve specific use cases quickly with the help of the docs. I will benchmark the separable convolution. If it is way faster than the 2D convolution, I will also write a tutorial. |
@kloudkl thank you for the benchmarking and potential tutorial. I agree that there could be much more thorough API and tutorial documentation. Now that our recent improvements are in the latest release we will be turning our attention to catching up with documentation soon. |
How will interact with #569? |
Closing since Boost::Python incurs no overhead as seen in #941 (comment) and the preprocessing is on the way to simplification and efficiency with #954 and #963 and a further follow-up to expose data transformations to Python by the |
@kloudkl you say there is no clear boundary between data augmentation and data preprocessing, but I think one should be careful when saying this. I have currently implemented on the fly data augmentation within caffe by modifying datum attributes. I have found it very useful to perform different data augmentations that depend on the labels themselves. For example, for MNIST, one will want to be more careful about rotating "1"s, as they may be confused with "7"s, than "3"s, whose labels are more invariant to rotation. I also don't want to augment the test data, so I add a datum attribute that tells me whether the data is test or train, and check this attribute before modifying the datum's attribute. Not pretty, but very fast and effective. This can probably be done within a data layer, but one might have to have separate data layers for testing and training data, unless I'm missing something. This is just food for thought to anyone thinking of implementing on the fly augmentation as part of a data layer. |
@nicodjimenez @kloudkl This is closed. We could continue data augmentation discussion at #701 |
I'm trying to implement an object detection python program that uses Caffe, and I'm trying to make it faster but I have not found any information on how to make it compatible with PyPy. Does anyone know how to achieve this? |
bruh. it's just not compatible. if you're looking for speed, writing custom C++ functions and adding them to the wrapper is probably your best bet. i'm really not trying to be dismissive here, i just spent a lot of time looking into pypy/caffe last week only to find that it's more complicated than it's worth. if you can find a way to pass data to caffe without using numpy, you might be able to get something going, but i'm not totally sure how that'd be done. |
Hey Slime! lol I can pass the data to caffe without using numpy, but you said caffe isn't compatible with pypy either, so I'm not sure that would help though? |
nah, you're thinking pycaffe. caffe is c++, it doesn't even know that pypy exists. the problem is that caffe needs a contiguous array of floats, and python lists are basically arrays of pointers to floats (or more accurately, pointers to doubles) so they need to be converted to contiguous memory to be operated on efficiently. typically numpy is used for this, but numPyPy is still in development and as such it is missing some functionality. you can certainly try the latest numPyPy implementation but there's no guarantee that it will be faster or even work at all. that being said, please do let us know how that goes as i'm sure there are a lot of people around here wanting more speed out of caffe without getting down and dirty with C++. in any case, this probably isn't the proper thread to discuss numPyPy ;) |
Boost.Python is too slow for a practical application system. It takes several hours to extract features for only ten thousand images using the current pycaffe interface. There shouldn't be such a huge performance gap between the C++ version and the other languages. According to some benchmark results, Cython can be much faster than Boost.Python. It Caffe wants the users to frequently use the concise Python interface in their daily experiments, the binding technology should be changed to Cython.
Simple benchmark between Cython and Boost.Python
http://blog.chrischou.org/2010/02/28/simple-benchmark-between-cython-and-boost-python/
C++ wrapper benchmark: Cython, PyBindGen, Boost
https://groups.google.com/forum/#!topic/cython-users/lQO9lGj5JEc
[Stackless] [C++-sig] [Boost] Trouble optimizing Boost.Python integration for game development (it seems too slow)
http://www.stackless.com/pipermail/stackless/2009-August/004249.html
Python vs. Cython vs. D (PyD) vs. C++ (SWIG)
http://prabhuramachandran.blogspot.com/2008/09/python-vs-cython-vs-d-pyd-vs-c-swig.html
https://github.com/cython/cython/wiki/SWIG
The text was updated successfully, but these errors were encountered: