Vectorize Paeth filtering on stable #511

Shnatsel · 2024-09-28T13:55:02Z

Paeth filtering was remarkably resistant to autovectorization, and it is the only instance where we had to resort to the nightly-only portable SIMD API.

Now that we're looking to make adaptive filtering the default, Paeth filter performance is going to become more important.

I've taken a stab at getting it to autovectorize on latest stable, and the results are promising!

Portable SIMD version (what we're looking to match): https://godbolt.org/z/Pdhx4Kdd8

What I've got so far: https://godbolt.org/z/63av9hbbY

You can see that it vectorized everything except the final loop that selects values; that just got unrolled into lots and lots of conditional moves, so at least it's unrolled and branchless and might benefit from ILP.

Both versions also benefit from -C target-cpu=x86-64-v3 over the default, but the difference isn't dramatic.

The text was updated successfully, but these errors were encountered:

Shnatsel · 2024-09-28T13:56:21Z

I also tried a 1:1 port of the SIMD filtering algorithm, but this is as far as I got before things started falling apart:
https://godbolt.org/z/W49qeGv3o
I got all the SIMD operations polyfilled, but if I go and actually swap them out more than I already did, the vectorization seems to fall apart completely.

Shnatsel · 2024-09-28T14:04:57Z

I got it to vectorize the comparisons too, so only the final value selection is still scalar: https://godbolt.org/z/TPdoWPPMd

Shnatsel · 2024-09-28T14:19:12Z

But the direct translation of Portable SIMD code is still pretty gnarly and doesn't look any faster: https://godbolt.org/z/b7G3xnsj8

Shnatsel · 2024-09-28T14:52:23Z

I've wired up my most successful attempt, let's see if it beats the autovectorization we already have in place:
https://github.com/Shnatsel/image-png/tree/simder-paeth

Not sure how to benchmark it though, the filters are already cheap compared to the rest of encoding

Shnatsel · 2024-09-28T16:46:28Z

Well, turns out the solution was right in front of me all along. The filtering code currently in use already vectorizes perfectly, resulting in code identical to the explicit SIMD version: https://godbolt.org/z/P8WWsTs6Y

Let me double-check if explicit SIMD still gets us any gains in unfiltering. If so, then we can adapt this trivial approach to get the same benefits on stable.

Shnatsel · 2024-09-28T19:13:34Z

Okay, I have a branch where I've simply replaced the handwritten SIMD implementations with autovectorized ones, and the results are pretty surprising.

At least on my x86_64 machine, with no -C target-cpu=native or other tuning, the autovectorized version beats both the current stable and even portable SIMD for bbp=3 and 6, but loses to portable SIMD for bpp 4 and 8, matching the current stable version for 4 and eking out a 10% improvement for 8.

Full benchmarks

unfilter/filter=Paeth/bpp=1
                        time:   [7.5332 µs 7.5336 µs 7.5340 µs]
                        thrpt:  [518.48 MiB/s 518.51 MiB/s 518.54 MiB/s]
                 change:
                        time:   [+0.0602% +0.0725% +0.0864%] (p = 0.00 < 0.05)
                        thrpt:  [-0.0863% -0.0724% -0.0602%]
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) low severe
  3 (3.00%) high mild
  4 (4.00%) high severe

unfilter/filter=Paeth/bpp=2
                        time:   [9.8020 µs 9.8146 µs 9.8258 µs]
                        thrpt:  [795.10 MiB/s 796.01 MiB/s 797.03 MiB/s]
                 change:
                        time:   [-0.1786% -0.0895% +0.0085%] (p = 0.04 < 0.05)
                        thrpt:  [-0.0085% +0.0896% +0.1790%]
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  12 (12.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high severe

unfilter/filter=Paeth/bpp=3
                        time:   [10.916 µs 10.916 µs 10.916 µs]
                        thrpt:  [1.0484 GiB/s 1.0484 GiB/s 1.0484 GiB/s]
                 change:
                        time:   [-14.326% -14.225% -14.157%] (p = 0.00 < 0.05)
                        thrpt:  [+16.492% +16.584% +16.722%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

unfilter/filter=Paeth/bpp=4
                        time:   [12.829 µs 12.857 µs 12.897 µs]
                        thrpt:  [1.1831 GiB/s 1.1868 GiB/s 1.1894 GiB/s]
                 change:
                        time:   [+22.465% +22.792% +23.355%] (p = 0.00 < 0.05)
                        thrpt:  [-18.933% -18.561% -18.344%]
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild
  5 (5.00%) high severe

unfilter/filter=Paeth/bpp=6
                        time:   [11.363 µs 11.366 µs 11.373 µs]
                        thrpt:  [2.0125 GiB/s 2.0138 GiB/s 2.0143 GiB/s]
                 change:
                        time:   [-11.365% -11.091% -10.899%] (p = 0.00 < 0.05)
                        thrpt:  [+12.233% +12.474% +12.823%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

unfilter/filter=Paeth/bpp=8
                        time:   [13.416 µs 13.423 µs 13.432 µs]
                        thrpt:  [2.2720 GiB/s 2.2736 GiB/s 2.2748 GiB/s]
                 change:
                        time:   [+27.803% +27.972% +28.118%] (p = 0.00 < 0.05)
                        thrpt:  [-21.947% -21.858% -21.755%]
                        Performance has regressed.

@fintelia how would you like me to proceed? Should I switch bpp 3 and 6 over to autovectorization, and remove the portable SIMD variant entirely? Or would you like me to keep the portable SIMD codepath for bpp 3 and 6 even though it's apparently worse?

fintelia · 2024-09-28T20:25:39Z

Looks like the autovectorization for bpp 3 and bpp 6 still relies on std::simd to persuade rustc to load 4 or 8 bytes at a time? In that case, it seems totally reasonable to remove the use of portable SIMD anywhere autovectorization will suffice but keep the broader structure

Shnatsel · 2024-09-28T20:59:54Z

In this prototype, yes. However, I am confident that I will be able to convert the 3 and 6 bpp codepaths to use [u8; 4] loads instead of std::simd types and keep nearly the same performance. The only difference is that the loads will not be aligned, but on modern CPUs that makes absolutely no difference.

Shnatsel · 2024-09-28T23:05:21Z

Well, that confidence was misplaced. Autovectorization fails here, and for a really interesting reason.

Here's a godbolt link to illustrate the perfect assembly we're looking for: https://godbolt.org/z/Wqoerq98T

Here's the assembly we actually get - no vectorization whatsoever: https://godbolt.org/z/h3sj8dWPh

The only change is that the function is instantiated with array length 4 rather than 8! Our inputs are too short for the compiler to consider vectorization!

My attempts at making the function operate on [u8; 8] failed to get it to vectorize, presumably because the compiler "sees" the data flow and realizes the latter bytes are actually never used.

Shnatsel · 2024-09-28T23:25:40Z

Note to future self: try converting the intermediate values to wider types to make the whole vector wider and coax the compiler into vectorizing it.

Shnatsel · 2024-09-29T01:20:44Z

i16 is wide enough for unfilter_paeth6, so it does get vectorized, but then I run into another problem: it doesn't get inlined, which also tanks performance.

I can force inlining, but then vectorization falls apart again, possibly because the compilers sees it only needs to compute 6 lanes instead of 8.

Shnatsel · 2024-09-29T02:16:43Z

Okay I've finally figured out the 3 and 6 bpp case, PR up: #513

Edit: nevermind, turns out the original was already vectorized, my changes actually scalarized it and that slightly improved performance on high-ILP CPUs

Shnatsel · 2024-10-29T13:29:28Z

I measured the performance with and without unstable feature to get a sense for the improvement we might expect from this change.

8-bit RGBA doesn't seem to benefit at all, or at least not on my machine. That makes sense - I've recently replaced the portable SIMD impl with autovectorization in #512 because autovectorization produced better assembly. We should probably just delete that codepath now.

On a large RGB (no alpha) 8-bit image made with imagemagick via convert -quality 49 in.png out.png (the first digit being 4 selects Paeth filter, yes that interface is confusing) I am seeing a 5% end-to-end performance improvement.

As I've found in #513, the current code for RGB Paeth unflitering does actually emit vector instructions. The portable SIMD codepath doesn't differ just in the vector instructions it emits, it uses completely different code for the main loop, and I suspect that is the part responsible for the performance improvement:

image-png/src/filter.rs

Lines 169 to 197 in e87c685

    
               pub fn unfilter_paeth3(mut prev_row: &[u8], mut curr_row: &mut [u8]) { 
        
                   debug_assert_eq!(prev_row.len(), curr_row.len()); 
        
                   debug_assert_eq!(prev_row.len() % 3, 0); 
        
                   let mut state = PaethState::<i16, 4>::default(); 
        
                   while prev_row.len() >= 4 { 
        
                       // `u8x4` requires working with `[u8;4]`, but we can just load and ignore the first 
        
                       // byte from the next triple.  This optimization technique mimics the algorithm found 
        
                       // in 
        
                       // https://github.com/glennrp/libpng/blob/f8e5fa92b0e37ab597616f554bee254157998227/intel/filter_sse2_intrinsics.c#L130-L131 
        
                       let b = u8x4::from_slice(prev_row); 
        
                       let mut x = u8x4::from_slice(curr_row); 
        
                       paeth_step(&mut state, b, &mut x); 
        
                       // We can speculate that writing 4 bytes might be more efficient (just as with using 
        
                       // `u8x4::from_slice` above), but we can't use that here, because we can't clobber the 
        
                       // first byte of the next pixel in the `curr_row`. 
        
                       store3(x, curr_row); 
        
                       prev_row = &prev_row[3..]; 
        
                       curr_row = &mut curr_row[3..]; 
        
                   } 
        
                   // Can't use `u8x4::from_slice` for the last `[u8;3]`. 
        
                   let b = load3(prev_row); 
        
                   let mut x = load3(curr_row); 
        
                   paeth_step(&mut state, b, &mut x); 
        
                   store3(x, curr_row); 
        
               }

If we could fit the autovectorized code for bpp=3 case that @okaneco came up with into this loop scaffolding, we should get the best of both worlds and be able to get rid of the unstable feature entirely.

Shnatsel · 2024-10-29T14:55:44Z

I couldn't get the straightforward adaptation of the 3bpp algorithm to autovectorize: https://godbolt.org/z/en417qP3b

Shnatsel · 2024-10-29T15:33:48Z

If I switch from i16 to i32 as the type on which all operations are performed, it vectorizes fine on modern CPUs: https://godbolt.org/z/T67Gn74ze

But in this form it completely fails to vectorize on base x86_64 and performance plummets: https://godbolt.org/z/8zGYbr5bb

Also, when measured against unstable feature with -C target-cpu=znver3, the unstable version still comes out slightly ahead.

I think I'm just going to give up here because while 5% is noticeable, only some images are bpp=3 and Paeth, so the average speedup across all images is going to be only about 2%, which is a lot less impressive.

Shnatsel · 2024-12-05T06:05:54Z

After #539 there is very little benefit to using the unstable feature in the common cases. On ARM it does more harm than good and is disabled entirely. On x86 it only helps significantly in the bpp=6 and bpp=8 cases, and the bpp=3 case only sees single-digit percetage improvement while bpp=4 does not benefit at all.

All the actual Paeth filtering now uses autovectorization instead of std::simd functions, even with the unstable feature enabled. The code structure with SIMD types just coaxes the compiler into autovectorizing the code a lot better.

Shnatsel mentioned this issue Sep 28, 2024

Replace handwritten SIMD implementation with autovectorization for +10% perf #512

Merged

Shnatsel mentioned this issue Sep 29, 2024

Autovectorize 3 and 6 bpp Paeth unfiltering on stable #513

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize Paeth filtering on stable #511

Vectorize Paeth filtering on stable #511

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

fintelia commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 29, 2024

Shnatsel commented Sep 29, 2024 •

edited

Loading

Shnatsel commented Oct 29, 2024

Shnatsel commented Oct 29, 2024

Shnatsel commented Oct 29, 2024

Shnatsel commented Dec 5, 2024

Vectorize Paeth filtering on stable #511

Vectorize Paeth filtering on stable #511

Comments

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

fintelia commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 28, 2024

Shnatsel commented Sep 29, 2024

Shnatsel commented Sep 29, 2024 • edited Loading

Shnatsel commented Oct 29, 2024

Shnatsel commented Oct 29, 2024

Shnatsel commented Oct 29, 2024

Shnatsel commented Dec 5, 2024

Shnatsel commented Sep 29, 2024 •

edited

Loading