getImageData and putImageData could be much faster #909

pcordes · 2017-04-26T04:48:14Z

NAN_METHOD(Context2d::GetImageData) and PutImageData could be much faster (at least a factor of 2, or much more with SIMD) by loading/storeing whole pixels as uint32_t. Use integer rotates and stuff to do the format conversions. Use an endian-conversion function that can inline (to a single bswap instruction for x86, for example), since compilers don't do a good job with single-byte loads/stores.

gcc doesn't merge the single-byte loads and stores into 32-bit stores (even on x86 where unaligned loads/stores never fault and are fast), so that's a major bottleneck. (If you look at the asm, it's not pretty).

Transforming ARGB to/from RGBA can be done with a single 32-bit rotate by 8 bits. See this SO post for a portable + safe rotate idiom that reliably compiles to a single rotate instruction on most compilers.

To handle the portable-to-native endian conversion, use something like uint32_t ntohl(uint32_t) from <arpa/inet.h> if available. (On any decent implementation, these functions will inline. e.g. on x86 gcc, it inlines to a single bswap instruction). I'm not sure about Windows, or if we'd need an #ifdef to get a different endian-conversion function / intrinsic.

In summary, the get and put image data functions should both be doing 32-bit loads and stores, and doing an endian-conversion + a rotate. The current GetImageData should run at about 1 byte per clock cycle on recent Intel CPUs (as compiled by gcc6.3.1 for x86-64), for ranges of pixels that have alpha == 0 or 255. Maybe even slower on AMD Bulldozer-family CPUs, since they have fewer integer execution units.

A 32-bit-at-a-time bswap+rotate version should be at least twice as fast. It probably can't manage 1 pixel per clock since it still has to check alpha. (Although you can do that without unpacking the alpha with a shift, by checking if pix >= (0xffUL << 24) || pix < (1UL << 24) in a format where we know alpha is in the MSB 8 bits.)

The floating-point stuff for alpha looks ridiculously slow. Probably an integer multiply and shift would be a win. (Or integer division by 255, which compilers can implement with multiply and shift.) At the very least, GetImageData should do float alpha = 255.0f / (float)a and then three multiplies, instead of the other way around and three divides. (Multiply is much faster than divide).

I guess I should just write the code and submit a pull request.

Of course, the alpha stuff could be done much faster with SIMD (floating-point or just unpack to 16-bit integer like most SIMD graphics code does). You don't even need to do runtime CPU detection to get a big speedup on x86-64, because SSE2 is baseline.

SSSE3 would be extremely useful though (for a byte-shuffle), and could speed up the easy-alpha case by maybe another factor of 4 (16 bytes at a time with a fallback if any of the 4 pixels have a non-easy alpha). Without pshufb (_mm_shuffle_epi8), you can use SSE2 bit-shifts and boolean (AND/OR/XOR) ops to exchange R and B for 4 pixels in a 16-byte vector. (Which is what a little-endian platform needs).

The text was updated successfully, but these errors were encountered:

pcordes · 2017-04-28T00:03:23Z

Working on the code now. I have scalar loops which should be about twice as fast, and SSE2 / SSSE3 versions (compile-time detection only) which should be maybe 4 times as fast. (On Intel Haswell. Timing estimated from looking at the asm output and http://agner.org/optimize/ instruction timings and microarch details. These loops are definitely not going to be memory-bandwidth bound, even from main memory :/ Maybe an AVX2 version could be.)

I need to test, and probably debug, but then I'll submit a merge request.

This was referenced Apr 26, 2017

Context2d::PutImageData doesn't appear to respect endian-dependent byte ordering of the canvas #908

Open

zero-copy API for reading or writing pixels directly on the canvas surface? #910

Open

zbjornson added the performance label Nov 5, 2017

zbjornson mentioned this issue Mar 8, 2018

Add prebuilds #992

Merged

zbjornson mentioned this issue Jun 1, 2018

Allow a format and explicit Buffer as backing store #1172

Open

1 task

zbjornson mentioned this issue Jun 13, 2020

Speed up putImageData for RGBA canvases #1606

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getImageData and putImageData could be much faster #909

getImageData and putImageData could be much faster #909

pcordes commented Apr 26, 2017 •

edited

Loading

pcordes commented Apr 28, 2017 •

edited

Loading

getImageData and putImageData could be much faster #909

getImageData and putImageData could be much faster #909

Comments

pcordes commented Apr 26, 2017 • edited Loading

pcordes commented Apr 28, 2017 • edited Loading

pcordes commented Apr 26, 2017 •

edited

Loading

pcordes commented Apr 28, 2017 •

edited

Loading