Performance improvements for benchmarks #115

GabrielMajeri · 2018-09-01T10:13:31Z

This pull request contains improvements to the performance of the example crates in this repository.

fannkuch_redux

I noticed a small performance reduction because LLVM was emitting memory accesses to the self.odd field, insted of storing it in a register.
Not sure why this happens, LLVM should know we borrow self mutably & have exclusive access to self and be able to cache its value locally.
Overall, about 1.5% better performance.

stencil

See issue #95 for the full discussion.

Now we're using mul_adde (FMA) where possible.
Improved performance of Data::reinit by using unsafe memory accesses. Not part of the benchmark, but it was slow and showing up a lot in the profiler.
The parallel code is slower than the SIMD code now:

simd: 175 ms
simd+par: 279 ms
ispc: 176 ms
ispc+tasks: 373 ms

I guess that's... good?

Failing tests

I've updated the testing code to use approximative float comparisons, with an error no larger than 10^-4.

aobench

Added mul_adde to most places, matching ISPC's assembly.
Inlined some functions which weren't getting inlined before.
Overall, when using the system math library (instead of comparing against ISPC's built-in one) Rust performance is better than with ISPC.

gnzlbg · 2018-09-03T13:58:15Z

I think this can be updated to use the new fmae function right?

GabrielMajeri · 2018-09-03T14:28:43Z

@gnzlbg I was going to wait for #123 to land so I can use mul_adde .

I'm not planning to get this merged however, not until we manage to beat ISPC in aobench at least.

LLVM would otherwise generate a lot of memory accesses.

While it's not part of the benchmark per se, it was still slow and showing up too much in the profiler.

GabrielMajeri · 2018-09-04T08:32:06Z

I'd decided to compare oranges to oranges and use libm for both ISPC and Rust, to avoid comparing using libsleef vs. ISPC's inlined math functions.

The results are in:

Rust: 4170 ms
ISPC: 4532 ms

So Rust is faster than ISPC. I don't know if it's due to a lack of inlining (#126), or some issue with sleef, but it's clear there's not much more we can do to optimize the Rust code itself. We're only limited by the speed of math functions.

examples/aobench/src/vector.rs

examples/stencil/src/lib.rs

gnzlbg · 2018-09-04T08:48:31Z

examples/stencil/src/simd.rs

@@ -158,7 +170,8 @@ pub fn x8(
 ) {
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    unsafe {
-        if is_x86_feature_detected!("avx2") {
+        if is_x86_feature_detected!("avx2") && is_x86_feature_detected!("fma")
+        {
            #[rustfmt::skip]
            x8_impl_avx2(t0, t1, x0, x1, y0, y1, z0, z1, n_x, n_y, n_z,


Same here for avx +- fma

gnzlbg · 2018-09-04T08:50:36Z

Note that ISPC inline math functions are not as precise as sleef, libm, SVML, etc.

Let me know when you finish the nitpicks and this is green so that we can merge this. I think it looks really good. I'll run the benchmarks on some systems and update the readme's afterwards (you can also add the results for your own systems in a follow up PR if you want).

gnzlbg · 2018-09-04T08:50:46Z

Also: THANKS for doing this!

Also enables FMA for the tiled implementation

GabrielMajeri · 2018-09-04T09:35:50Z

Alright, just fixed a little big issue with the RNG: those functions were not marked as #[inline(always)], so the target_feature attribute would end up not applying to them, so we would end up with AVX2 code calling into a function built with SSE2 instructions.

I think I've addressed most of the issues I could, some final bechmark results here (with libm, in order from slowest to fastest):

ispc: 4542 ms
vector: 4355 ms
tiled: 4117 ms

vector_par:  1233 ms
ispc tasks: 1227 ms
tiled_par: 1170 ms

Side note: would it be fair to modify build.rs to force ISPC to always use the system's math library? After all, this repository is not about benchmarking's ISPC fast math library

gnzlbg · 2018-09-04T10:23:37Z

Alright, just fixed a little big issue with the RNG:

Damn, this is a know problem, and we have no lints against this yet :(

Side note: would it be fair to modify build.rs to force ISPC to always use the system's math library?

I'd prefer the default to be ispc's default, but we can add cargo features (e.g. ispc_svml, ispc_libm, ispc_builtin_math...) that select a different math library in the build.rs.

I'd prefer if you do this in a different PR, because it applies to many benchmarks and I'd like to merge this ;)

GabrielMajeri · 2018-09-04T11:10:11Z

@gnzlbg Clippy failed, but I don't think it's related to this PR.

I'd prefer if you do this in a different PR, because it applies to many benchmarks and I'd like to merge this ;)

Alright, I'll see if I can't get cross-language inlining to work, maybe we can use sleef to beat ISPC's math library. Are you on IRC ?

gnzlbg · 2018-09-04T11:31:15Z

Clippy failed, but I don't think it's related to this PR.

I'll fix that :)

Alright, I'll see if I can't get cross-language inlining to work, maybe we can use sleef to beat ISPC's math library.

I have a branch where I started doing this the other day. I can upload it if you want to pick it up from there.

gnzlbg · 2018-09-04T12:51:12Z

@GabrielMajeri i'm gnzlbg on IRC

GabrielMajeri force-pushed the perf branch 3 times, most recently from 77d2688 to 590a9e4 Compare September 3, 2018 14:27

GabrielMajeri force-pushed the perf branch 2 times, most recently from e822d87 to 77bfb73 Compare September 4, 2018 08:22

GabrielMajeri added 6 commits September 4, 2018 11:22

Cache odd as a local variable

bfc8381

LLVM would otherwise generate a lot of memory accesses.

Improve speed of Data::reinit

1954c88

While it's not part of the benchmark per se, it was still slow and showing up too much in the profiler.

Use FMA explicitly in stencil example

3499dff

Use approximate comparisons for floats

c9c90e9

Use FMA and rsqrt in more places

a1c178a

Inline the rng functions

bfd4f2a

GabrielMajeri force-pushed the perf branch from 77bfb73 to bfd4f2a Compare September 4, 2018 08:24

GabrielMajeri changed the title ~~[WIP] Performance improvements for benchmarks~~ Performance improvements for benchmarks Sep 4, 2018

gnzlbg reviewed Sep 4, 2018

View reviewed changes

examples/aobench/src/vector.rs Show resolved Hide resolved

gnzlbg reviewed Sep 4, 2018

View reviewed changes

examples/stencil/src/lib.rs Show resolved Hide resolved

gnzlbg reviewed Sep 4, 2018

View reviewed changes

GabrielMajeri added 3 commits September 4, 2018 12:10

Add a AVX1 + FMA specialization

99e8283

Also enables FMA for the tiled implementation

Use fuzzy float comparisons in aobench tests

57c1426

Ensure random generator functions get inlined

82a1174

gnzlbg merged commit a1406b1 into rust-lang:master Sep 4, 2018

GabrielMajeri deleted the perf branch September 4, 2018 11:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements for benchmarks #115

Performance improvements for benchmarks #115

GabrielMajeri commented Sep 1, 2018 •

edited

Loading

gnzlbg commented Sep 3, 2018

GabrielMajeri commented Sep 3, 2018 •

edited

Loading

GabrielMajeri commented Sep 4, 2018 •

edited

Loading

gnzlbg Sep 4, 2018

gnzlbg commented Sep 4, 2018

gnzlbg commented Sep 4, 2018

GabrielMajeri commented Sep 4, 2018 •

edited

Loading

gnzlbg commented Sep 4, 2018

GabrielMajeri commented Sep 4, 2018 •

edited by gnzlbg

Loading

gnzlbg commented Sep 4, 2018

gnzlbg commented Sep 4, 2018

Performance improvements for benchmarks #115

Performance improvements for benchmarks #115

Conversation

GabrielMajeri commented Sep 1, 2018 • edited Loading

fannkuch_redux

stencil

Failing tests

aobench

gnzlbg commented Sep 3, 2018

GabrielMajeri commented Sep 3, 2018 • edited Loading

GabrielMajeri commented Sep 4, 2018 • edited Loading

gnzlbg Sep 4, 2018

Choose a reason for hiding this comment

gnzlbg commented Sep 4, 2018

gnzlbg commented Sep 4, 2018

GabrielMajeri commented Sep 4, 2018 • edited Loading

gnzlbg commented Sep 4, 2018

GabrielMajeri commented Sep 4, 2018 • edited by gnzlbg Loading

gnzlbg commented Sep 4, 2018

gnzlbg commented Sep 4, 2018

GabrielMajeri commented Sep 1, 2018 •

edited

Loading

GabrielMajeri commented Sep 3, 2018 •

edited

Loading

GabrielMajeri commented Sep 4, 2018 •

edited

Loading

GabrielMajeri commented Sep 4, 2018 •

edited

Loading

GabrielMajeri commented Sep 4, 2018 •

edited by gnzlbg

Loading