Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getImageData and putImageData could be much faster #909

Open
pcordes opened this issue Apr 26, 2017 · 1 comment
Open

getImageData and putImageData could be much faster #909

pcordes opened this issue Apr 26, 2017 · 1 comment

Comments

@pcordes
Copy link

pcordes commented Apr 26, 2017

NAN_METHOD(Context2d::GetImageData) and PutImageData could be much faster (at least a factor of 2, or much more with SIMD) by loading/storeing whole pixels as uint32_t. Use integer rotates and stuff to do the format conversions. Use an endian-conversion function that can inline (to a single bswap instruction for x86, for example), since compilers don't do a good job with single-byte loads/stores.


gcc doesn't merge the single-byte loads and stores into 32-bit stores (even on x86 where unaligned loads/stores never fault and are fast), so that's a major bottleneck. (If you look at the asm, it's not pretty).

Transforming ARGB to/from RGBA can be done with a single 32-bit rotate by 8 bits. See this SO post for a portable + safe rotate idiom that reliably compiles to a single rotate instruction on most compilers.

To handle the portable-to-native endian conversion, use something like uint32_t ntohl(uint32_t) from <arpa/inet.h> if available. (On any decent implementation, these functions will inline. e.g. on x86 gcc, it inlines to a single bswap instruction). I'm not sure about Windows, or if we'd need an #ifdef to get a different endian-conversion function / intrinsic.

In summary, the get and put image data functions should both be doing 32-bit loads and stores, and doing an endian-conversion + a rotate. The current GetImageData should run at about 1 byte per clock cycle on recent Intel CPUs (as compiled by gcc6.3.1 for x86-64), for ranges of pixels that have alpha == 0 or 255. Maybe even slower on AMD Bulldozer-family CPUs, since they have fewer integer execution units.

A 32-bit-at-a-time bswap+rotate version should be at least twice as fast. It probably can't manage 1 pixel per clock since it still has to check alpha. (Although you can do that without unpacking the alpha with a shift, by checking if pix >= (0xffUL << 24) || pix < (1UL << 24) in a format where we know alpha is in the MSB 8 bits.)

The floating-point stuff for alpha looks ridiculously slow. Probably an integer multiply and shift would be a win. (Or integer division by 255, which compilers can implement with multiply and shift.) At the very least, GetImageData should do float alpha = 255.0f / (float)a and then three multiplies, instead of the other way around and three divides. (Multiply is much faster than divide).

I guess I should just write the code and submit a pull request.


Of course, the alpha stuff could be done much faster with SIMD (floating-point or just unpack to 16-bit integer like most SIMD graphics code does). You don't even need to do runtime CPU detection to get a big speedup on x86-64, because SSE2 is baseline.

SSSE3 would be extremely useful though (for a byte-shuffle), and could speed up the easy-alpha case by maybe another factor of 4 (16 bytes at a time with a fallback if any of the 4 pixels have a non-easy alpha). Without pshufb (_mm_shuffle_epi8), you can use SSE2 bit-shifts and boolean (AND/OR/XOR) ops to exchange R and B for 4 pixels in a 16-byte vector. (Which is what a little-endian platform needs).

@pcordes
Copy link
Author

pcordes commented Apr 28, 2017

Working on the code now. I have scalar loops which should be about twice as fast, and SSE2 / SSSE3 versions (compile-time detection only) which should be maybe 4 times as fast. (On Intel Haswell. Timing estimated from looking at the asm output and http://agner.org/optimize/ instruction timings and microarch details. These loops are definitely not going to be memory-bandwidth bound, even from main memory :/ Maybe an AVX2 version could be.)

I need to test, and probably debug, but then I'll submit a merge request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants