Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace handwritten SIMD implementation with autovectorization for +10% perf #512

Merged
merged 1 commit into from
Oct 5, 2024

Conversation

Shnatsel
Copy link
Contributor

@Shnatsel Shnatsel commented Sep 28, 2024

This gives a +10% performance boost on 3bbp and 6bbp unfiltering benchmarks.

cc @anforowicz who wrote the explicit SIMD code

The downside of autovectorization is that it can be finicky and is not at all guarateed. So this may potentially be more fragile than explicit SIMD.

On the flip side, we've been relying on autovectorization for performance for ages and it has always worked once we got it working once. I've checked this sample on Godbolt and it didn't vectorize on Rust v1.60, but every single version starting from 1.61 does vectorize it.

ARM seems to benefit as well, resulting in vectorized and shorter code on godbolt. I don't have an aarch64 machine to measure it though.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Sep 28, 2024

There's some background on this in #511

I'm still hoping to coax autovectorization into working without any std::simd usage at all, but I haven't managed it so far. If I succeed, that will probably supersede this PR - but I figured it's a good idea to start the discussion about replacing std::simd with autovectorization early, in case anyone is opposed to this approach.

@fintelia
Copy link
Contributor

figured it's a good idea to start the discussion about replacing std::simd with autovectorization early, in case anyone is opposed to this approach.

I support the switch. Using std::simd was only begrudgingly accepted when we couldn't figure out a way to get the relevant code to autovectorize.

@Shnatsel
Copy link
Contributor Author

To clarify, this PR still requires std::simd types in function signatures, and thus still requires nightly compiler.

I'm still trying (and failing) to get it to work on stable. For the purposes of this review let's assume that nightly is still going to be required, even for the autovec implementation.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Oct 5, 2024

This performs better in benchmarks, and produces shorter but fully vectorized assembly on both x86_64 and Aarch64.

Since this seems to be an unambiguous improvement, I'm going to go ahead and merge this so that the PR doesn't stall.

@Shnatsel Shnatsel merged commit 3fbbbb1 into image-rs:master Oct 5, 2024
19 checks passed
/// - RGBA => 4 lanes of `i16x4` contain R, G, B, A
/// - RGB => 4 lanes of `i16x4` contain R, G, B, and a ignored 4th value
///
/// The SIMD algorithm below is based on [`libpng`](https://github.com/glennrp/libpng/blob/f8e5fa92b0e37ab597616f554bee254157998227/intel/filter_sse2_intrinsics.c#L261-L280).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't correct.

As it states on L72, the logic is a translation of how pa, pb, and pc are calculated in filter::filter_paeth

image-png/src/filter.rs

Lines 362 to 401 in 3fbbbb1

fn filter_paeth(a: u8, b: u8, c: u8) -> u8 {
// This is an optimized version of the paeth filter from the PNG specification, proposed by
// Luca Versari for [FPNGE](https://www.lucaversari.it/FJXL_and_FPNGE.pdf). It operates
// entirely on unsigned 8-bit quantities, making it more conducive to vectorization.
//
// p = a + b - c
// pa = |p - a| = |a + b - c - a| = |b - c| = max(b, c) - min(b, c)
// pb = |p - b| = |a + b - c - b| = |a - c| = max(a, c) - min(a, c)
// pc = |p - c| = |a + b - c - c| = |(b - c) + (a - c)| = ...
//
// Further optimizing the calculation of `pc` a bit tricker. However, notice that:
//
// a > c && b > c
// ==> (a - c) > 0 && (b - c) > 0
// ==> pc > (a - c) && pc > (b - c)
// ==> pc > |a - c| && pc > |b - c|
// ==> pc > pb && pc > pa
//
// Meaning that if `c` is smaller than `a` and `b`, the value of `pc` is irrelevant. Similar
// reasoning applies if `c` is larger than the other two inputs. Assuming that `c >= b` and
// `c <= b` or vice versa:
//
// pc = ||b - c| - |a - c|| = |pa - pb| = max(pa, pb) - min(pa, pb)
//
let pa = b.max(c) - c.min(b);
let pb = a.max(c) - c.min(a);
let pc = if (a < c) == (c < b) {
pa.max(pb) - pa.min(pb)
} else {
255
};
if pa <= pb && pa <= pc {
a
} else if pb <= pc {
b
} else {
c
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out!

I didn't actually add this comment, it's an artifact of how Github calculated the diff. You can see the same lines removed further up.

I'd be happy to merge a PR correcting it, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants