-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize Paeth filtering on stable #511
Comments
I also tried a 1:1 port of the SIMD filtering algorithm, but this is as far as I got before things started falling apart: |
I got it to vectorize the comparisons too, so only the final value selection is still scalar: https://godbolt.org/z/TPdoWPPMd |
But the direct translation of Portable SIMD code is still pretty gnarly and doesn't look any faster: https://godbolt.org/z/b7G3xnsj8 |
I've wired up my most successful attempt, let's see if it beats the autovectorization we already have in place: Not sure how to benchmark it though, the filters are already cheap compared to the rest of encoding |
Well, turns out the solution was right in front of me all along. The filtering code currently in use already vectorizes perfectly, resulting in code identical to the explicit SIMD version: https://godbolt.org/z/P8WWsTs6Y Let me double-check if explicit SIMD still gets us any gains in unfiltering. If so, then we can adapt this trivial approach to get the same benefits on stable. |
Okay, I have a branch where I've simply replaced the handwritten SIMD implementations with autovectorized ones, and the results are pretty surprising. At least on my x86_64 machine, with no Full benchmarks
@fintelia how would you like me to proceed? Should I switch bpp 3 and 6 over to autovectorization, and remove the portable SIMD variant entirely? Or would you like me to keep the portable SIMD codepath for bpp 3 and 6 even though it's apparently worse? |
Looks like the autovectorization for bpp 3 and bpp 6 still relies on |
In this prototype, yes. However, I am confident that I will be able to convert the 3 and 6 bpp codepaths to use |
Well, that confidence was misplaced. Autovectorization fails here, and for a really interesting reason. Here's a godbolt link to illustrate the perfect assembly we're looking for: https://godbolt.org/z/Wqoerq98T Here's the assembly we actually get - no vectorization whatsoever: https://godbolt.org/z/h3sj8dWPh The only change is that the function is instantiated with array length 4 rather than 8! Our inputs are too short for the compiler to consider vectorization! My attempts at making the function operate on |
Note to future self: try converting the intermediate values to wider types to make the whole vector wider and coax the compiler into vectorizing it. |
i16 is wide enough for I can force inlining, but then vectorization falls apart again, possibly because the compilers sees it only needs to compute 6 lanes instead of 8. |
Okay I've finally figured out the 3 and 6 bpp case, PR up: #513 Edit: nevermind, turns out the original was already vectorized, my changes actually scalarized it and that slightly improved performance on high-ILP CPUs |
I measured the performance with and without 8-bit RGBA doesn't seem to benefit at all, or at least not on my machine. That makes sense - I've recently replaced the portable SIMD impl with autovectorization in #512 because autovectorization produced better assembly. We should probably just delete that codepath now. On a large RGB (no alpha) 8-bit image made with imagemagick via As I've found in #513, the current code for RGB Paeth unflitering does actually emit vector instructions. The portable SIMD codepath doesn't differ just in the vector instructions it emits, it uses completely different code for the main loop, and I suspect that is the part responsible for the performance improvement: Lines 169 to 197 in e87c685
If we could fit the autovectorized code for bpp=3 case that @okaneco came up with into this loop scaffolding, we should get the best of both worlds and be able to get rid of the |
I couldn't get the straightforward adaptation of the 3bpp algorithm to autovectorize: https://godbolt.org/z/en417qP3b |
If I switch from i16 to i32 as the type on which all operations are performed, it vectorizes fine on modern CPUs: https://godbolt.org/z/T67Gn74ze But in this form it completely fails to vectorize on base x86_64 and performance plummets: https://godbolt.org/z/8zGYbr5bb Also, when measured against I think I'm just going to give up here because while 5% is noticeable, only some images are bpp=3 and Paeth, so the average speedup across all images is going to be only about 2%, which is a lot less impressive. |
After #539 there is very little benefit to using the All the actual Paeth filtering now uses autovectorization instead of |
Paeth filtering was remarkably resistant to autovectorization, and it is the only instance where we had to resort to the nightly-only portable SIMD API.
Now that we're looking to make adaptive filtering the default, Paeth filter performance is going to become more important.
I've taken a stab at getting it to autovectorize on latest stable, and the results are promising!
Portable SIMD version (what we're looking to match): https://godbolt.org/z/Pdhx4Kdd8
What I've got so far: https://godbolt.org/z/63av9hbbY
You can see that it vectorized everything except the final loop that selects values; that just got unrolled into lots and lots of conditional moves, so at least it's unrolled and branchless and might benefit from ILP.
Both versions also benefit from
-C target-cpu=x86-64-v3
over the default, but the difference isn't dramatic.The text was updated successfully, but these errors were encountered: