Replace handwritten SIMD implementation with autovectorization for +10% perf #512

Shnatsel · 2024-09-28T23:36:40Z

This gives a +10% performance boost on 3bbp and 6bbp unfiltering benchmarks.

cc @anforowicz who wrote the explicit SIMD code

The downside of autovectorization is that it can be finicky and is not at all guarateed. So this may potentially be more fragile than explicit SIMD.

On the flip side, we've been relying on autovectorization for performance for ages and it has always worked once we got it working once. I've checked this sample on Godbolt and it didn't vectorize on Rust v1.60, but every single version starting from 1.61 does vectorize it.

ARM seems to benefit as well, resulting in vectorized and shorter code on godbolt. I don't have an aarch64 machine to measure it though.

…surprising performance gain

Shnatsel · 2024-09-28T23:42:13Z

There's some background on this in #511

I'm still hoping to coax autovectorization into working without any std::simd usage at all, but I haven't managed it so far. If I succeed, that will probably supersede this PR - but I figured it's a good idea to start the discussion about replacing std::simd with autovectorization early, in case anyone is opposed to this approach.

fintelia · 2024-09-29T01:06:58Z

figured it's a good idea to start the discussion about replacing std::simd with autovectorization early, in case anyone is opposed to this approach.

I support the switch. Using std::simd was only begrudgingly accepted when we couldn't figure out a way to get the relevant code to autovectorize.

Shnatsel · 2024-09-29T01:16:47Z

To clarify, this PR still requires std::simd types in function signatures, and thus still requires nightly compiler.

I'm still trying (and failing) to get it to work on stable. For the purposes of this review let's assume that nightly is still going to be required, even for the autovec implementation.

Shnatsel · 2024-10-05T12:04:41Z

This performs better in benchmarks, and produces shorter but fully vectorized assembly on both x86_64 and Aarch64.

Since this seems to be an unambiguous improvement, I'm going to go ahead and merge this so that the PR doesn't stall.

okaneco · 2024-10-05T19:51:05Z

src/filter.rs

+    /// - RGBA => 4 lanes of `i16x4` contain R, G, B, A
+    /// - RGB  => 4 lanes of `i16x4` contain R, G, B, and a ignored 4th value
+    ///
+    /// The SIMD algorithm below is based on [`libpng`](https://github.com/glennrp/libpng/blob/f8e5fa92b0e37ab597616f554bee254157998227/intel/filter_sse2_intrinsics.c#L261-L280).


This comment isn't correct.

As it states on L72, the logic is a translation of how pa, pb, and pc are calculated in filter::filter_paeth

image-png/src/filter.rs

Lines 362 to 401 in 3fbbbb1

fn filter_paeth(a: u8, b: u8, c: u8) -> u8 {

// This is an optimized version of the paeth filter from the PNG specification, proposed by

// Luca Versari for [FPNGE](https://www.lucaversari.it/FJXL_and_FPNGE.pdf). It operates

// entirely on unsigned 8-bit quantities, making it more conducive to vectorization.

//

// p = a + b - c

// pa = |p - a| = |a + b - c - a| = |b - c| = max(b, c) - min(b, c)

// pb = |p - b| = |a + b - c - b| = |a - c| = max(a, c) - min(a, c)

// pc = |p - c| = |a + b - c - c| = |(b - c) + (a - c)| = ...

//

// Further optimizing the calculation of `pc` a bit tricker. However, notice that:

//

// a > c && b > c

// ==> (a - c) > 0 && (b - c) > 0

// ==> pc > (a - c) && pc > (b - c)

// ==> pc > |a - c| && pc > |b - c|

// ==> pc > pb && pc > pa

//

// Meaning that if `c` is smaller than `a` and `b`, the value of `pc` is irrelevant. Similar

// reasoning applies if `c` is larger than the other two inputs. Assuming that `c >= b` and

// `c <= b` or vice versa:

//

// pc = ||b - c| - |a - c|| = |pa - pb| = max(pa, pb) - min(pa, pb)

//

let pa = b.max(c) - c.min(b);

let pb = a.max(c) - c.min(a);

let pc = if (a < c) == (c < b) {

pa.max(pb) - pa.min(pb)

} else {

255

};

if pa <= pb && pa <= pc {

a

} else if pb <= pc {

b

} else {

c

}

}

Thanks for pointing this out!

I didn't actually add this comment, it's an artifact of how Github calculated the diff. You can see the same lines removed further up.

I'd be happy to merge a PR correcting it, though!

Replace handwritten SIMD implementation with autovectorization for a …

f1b75ae

…surprising performance gain

This was referenced Sep 29, 2024

Autovectorize 3 and 6 bpp Paeth unfiltering on stable #513

Closed

Consider runtime CPU feature detection #514

Closed

Shnatsel merged commit 3fbbbb1 into image-rs:master Oct 5, 2024
19 checks passed

okaneco reviewed Oct 5, 2024

View reviewed changes

Shnatsel mentioned this pull request Oct 29, 2024

Vectorize Paeth filtering on stable #511

Open

vitorhnn mentioned this pull request Dec 12, 2024

Make MaskScalar more amenable to autovectorization OpenRCT2/OpenRCT2#23387

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace handwritten SIMD implementation with autovectorization for +10% perf #512

Replace handwritten SIMD implementation with autovectorization for +10% perf #512

Shnatsel commented Sep 28, 2024 •

edited

Loading

Shnatsel commented Sep 28, 2024 •

edited

Loading

fintelia commented Sep 29, 2024

Shnatsel commented Sep 29, 2024

Shnatsel commented Oct 5, 2024

okaneco Oct 5, 2024

Shnatsel Oct 5, 2024

	fn filter_paeth(a: u8, b: u8, c: u8) -> u8 {
	// This is an optimized version of the paeth filter from the PNG specification, proposed by
	// Luca Versari for [FPNGE](https://www.lucaversari.it/FJXL_and_FPNGE.pdf). It operates
	// entirely on unsigned 8-bit quantities, making it more conducive to vectorization.
	//
	// p = a + b - c
	// pa = \|p - a\| = \|a + b - c - a\| = \|b - c\| = max(b, c) - min(b, c)
	// pb = \|p - b\| = \|a + b - c - b\| = \|a - c\| = max(a, c) - min(a, c)
	// pc = \|p - c\| = \|a + b - c - c\| = \|(b - c) + (a - c)\| = ...
	//
	// Further optimizing the calculation of `pc` a bit tricker. However, notice that:
	//
	// a > c && b > c
	// ==> (a - c) > 0 && (b - c) > 0
	// ==> pc > (a - c) && pc > (b - c)
	// ==> pc > \|a - c\| && pc > \|b - c\|
	// ==> pc > pb && pc > pa
	//
	// Meaning that if `c` is smaller than `a` and `b`, the value of `pc` is irrelevant. Similar
	// reasoning applies if `c` is larger than the other two inputs. Assuming that `c >= b` and
	// `c <= b` or vice versa:
	//
	// pc = \|\|b - c\| - \|a - c\|\| = \|pa - pb\| = max(pa, pb) - min(pa, pb)
	//
	let pa = b.max(c) - c.min(b);
	let pb = a.max(c) - c.min(a);
	let pc = if (a < c) == (c < b) {
	pa.max(pb) - pa.min(pb)
	} else {
	255
	};

	if pa <= pb && pa <= pc {
	a
	} else if pb <= pc {
	b
	} else {
	c
	}
	}

Replace handwritten SIMD implementation with autovectorization for +10% perf #512

Replace handwritten SIMD implementation with autovectorization for +10% perf #512

Conversation

Shnatsel commented Sep 28, 2024 • edited Loading

Shnatsel commented Sep 28, 2024 • edited Loading

fintelia commented Sep 29, 2024

Shnatsel commented Sep 29, 2024

Shnatsel commented Oct 5, 2024

okaneco Oct 5, 2024

Choose a reason for hiding this comment

Shnatsel Oct 5, 2024

Choose a reason for hiding this comment

Shnatsel commented Sep 28, 2024 •

edited

Loading

Shnatsel commented Sep 28, 2024 •

edited

Loading