-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace handwritten SIMD implementation with autovectorization for +10% perf #512
Conversation
…surprising performance gain
There's some background on this in #511 I'm still hoping to coax autovectorization into working without any |
I support the switch. Using |
To clarify, this PR still requires I'm still trying (and failing) to get it to work on stable. For the purposes of this review let's assume that nightly is still going to be required, even for the autovec implementation. |
This performs better in benchmarks, and produces shorter but fully vectorized assembly on both x86_64 and Aarch64. Since this seems to be an unambiguous improvement, I'm going to go ahead and merge this so that the PR doesn't stall. |
/// - RGBA => 4 lanes of `i16x4` contain R, G, B, A | ||
/// - RGB => 4 lanes of `i16x4` contain R, G, B, and a ignored 4th value | ||
/// | ||
/// The SIMD algorithm below is based on [`libpng`](https://github.com/glennrp/libpng/blob/f8e5fa92b0e37ab597616f554bee254157998227/intel/filter_sse2_intrinsics.c#L261-L280). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment isn't correct.
As it states on L72, the logic is a translation of how pa, pb, and pc are calculated in filter::filter_paeth
Lines 362 to 401 in 3fbbbb1
fn filter_paeth(a: u8, b: u8, c: u8) -> u8 { | |
// This is an optimized version of the paeth filter from the PNG specification, proposed by | |
// Luca Versari for [FPNGE](https://www.lucaversari.it/FJXL_and_FPNGE.pdf). It operates | |
// entirely on unsigned 8-bit quantities, making it more conducive to vectorization. | |
// | |
// p = a + b - c | |
// pa = |p - a| = |a + b - c - a| = |b - c| = max(b, c) - min(b, c) | |
// pb = |p - b| = |a + b - c - b| = |a - c| = max(a, c) - min(a, c) | |
// pc = |p - c| = |a + b - c - c| = |(b - c) + (a - c)| = ... | |
// | |
// Further optimizing the calculation of `pc` a bit tricker. However, notice that: | |
// | |
// a > c && b > c | |
// ==> (a - c) > 0 && (b - c) > 0 | |
// ==> pc > (a - c) && pc > (b - c) | |
// ==> pc > |a - c| && pc > |b - c| | |
// ==> pc > pb && pc > pa | |
// | |
// Meaning that if `c` is smaller than `a` and `b`, the value of `pc` is irrelevant. Similar | |
// reasoning applies if `c` is larger than the other two inputs. Assuming that `c >= b` and | |
// `c <= b` or vice versa: | |
// | |
// pc = ||b - c| - |a - c|| = |pa - pb| = max(pa, pb) - min(pa, pb) | |
// | |
let pa = b.max(c) - c.min(b); | |
let pb = a.max(c) - c.min(a); | |
let pc = if (a < c) == (c < b) { | |
pa.max(pb) - pa.min(pb) | |
} else { | |
255 | |
}; | |
if pa <= pb && pa <= pc { | |
a | |
} else if pb <= pc { | |
b | |
} else { | |
c | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out!
I didn't actually add this comment, it's an artifact of how Github calculated the diff. You can see the same lines removed further up.
I'd be happy to merge a PR correcting it, though!
This gives a +10% performance boost on 3bbp and 6bbp unfiltering benchmarks.
cc @anforowicz who wrote the explicit SIMD code
The downside of autovectorization is that it can be finicky and is not at all guarateed. So this may potentially be more fragile than explicit SIMD.
On the flip side, we've been relying on autovectorization for performance for ages and it has always worked once we got it working once. I've checked this sample on Godbolt and it didn't vectorize on Rust v1.60, but every single version starting from 1.61 does vectorize it.
ARM seems to benefit as well, resulting in vectorized and shorter code on godbolt. I don't have an aarch64 machine to measure it though.