Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements for benchmarks #115

Merged
merged 9 commits into from
Sep 4, 2018
Merged

Conversation

GabrielMajeri
Copy link
Contributor

@GabrielMajeri GabrielMajeri commented Sep 1, 2018

This pull request contains improvements to the performance of the example crates in this repository.

fannkuch_redux

  • I noticed a small performance reduction because LLVM was emitting memory accesses to the self.odd field, insted of storing it in a register.
    Not sure why this happens, LLVM should know we borrow self mutably & have exclusive access to self and be able to cache its value locally.

  • Overall, about 1.5% better performance.

stencil

See issue #95 for the full discussion.

  • Now we're using mul_adde (FMA) where possible.
  • Improved performance of Data::reinit by using unsafe memory accesses. Not part of the benchmark, but it was slow and showing up a lot in the profiler.
  • The parallel code is slower than the SIMD code now:
simd: 175 ms
simd+par: 279 ms
ispc: 176 ms
ispc+tasks: 373 ms

I guess that's... good?

Failing tests

  • I've updated the testing code to use approximative float comparisons, with an error no larger than 10^-4.

aobench

  • Added mul_adde to most places, matching ISPC's assembly.
  • Inlined some functions which weren't getting inlined before.
  • Overall, when using the system math library (instead of comparing against ISPC's built-in one) Rust performance is better than with ISPC.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 3, 2018

I think this can be updated to use the new fmae function right?

@GabrielMajeri GabrielMajeri force-pushed the perf branch 3 times, most recently from 77d2688 to 590a9e4 Compare September 3, 2018 14:27
@GabrielMajeri
Copy link
Contributor Author

GabrielMajeri commented Sep 3, 2018

@gnzlbg I was going to wait for #123 to land so I can use mul_adde .

I'm not planning to get this merged however, not until we manage to beat ISPC in aobench at least.

@GabrielMajeri GabrielMajeri force-pushed the perf branch 2 times, most recently from e822d87 to 77bfb73 Compare September 4, 2018 08:22
LLVM would otherwise generate a lot of memory accesses.
While it's not part of the benchmark per se, it was still slow
and showing up too much in the profiler.
@GabrielMajeri GabrielMajeri changed the title [WIP] Performance improvements for benchmarks Performance improvements for benchmarks Sep 4, 2018
@GabrielMajeri
Copy link
Contributor Author

GabrielMajeri commented Sep 4, 2018

I'd decided to compare oranges to oranges and use libm for both ISPC and Rust, to avoid comparing using libsleef vs. ISPC's inlined math functions.

The results are in:

Rust: 4170 ms
ISPC: 4532 ms

So Rust is faster than ISPC. I don't know if it's due to a lack of inlining (#126), or some issue with sleef, but it's clear there's not much more we can do to optimize the Rust code itself. We're only limited by the speed of math functions.

@@ -158,7 +170,8 @@ pub fn x8(
) {
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
unsafe {
if is_x86_feature_detected!("avx2") {
if is_x86_feature_detected!("avx2") && is_x86_feature_detected!("fma")
{
#[rustfmt::skip]
x8_impl_avx2(t0, t1, x0, x1, y0, y1, z0, z1, n_x, n_y, n_z,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for avx +- fma

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 4, 2018

Note that ISPC inline math functions are not as precise as sleef, libm, SVML, etc.

Let me know when you finish the nitpicks and this is green so that we can merge this. I think it looks really good. I'll run the benchmarks on some systems and update the readme's afterwards (you can also add the results for your own systems in a follow up PR if you want).

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 4, 2018

Also: THANKS for doing this!

@GabrielMajeri
Copy link
Contributor Author

GabrielMajeri commented Sep 4, 2018

Alright, just fixed a little big issue with the RNG: those functions were not marked as #[inline(always)], so the target_feature attribute would end up not applying to them, so we would end up with AVX2 code calling into a function built with SSE2 instructions.


I think I've addressed most of the issues I could, some final bechmark results here (with libm, in order from slowest to fastest):

ispc: 4542 ms
vector: 4355 ms
tiled: 4117 ms

vector_par:  1233 ms
ispc tasks: 1227 ms
tiled_par: 1170 ms

Side note: would it be fair to modify build.rs to force ISPC to always use the system's math library? After all, this repository is not about benchmarking's ISPC fast math library

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 4, 2018

Alright, just fixed a little big issue with the RNG:

Damn, this is a know problem, and we have no lints against this yet :(

Side note: would it be fair to modify build.rs to force ISPC to always use the system's math library?

I'd prefer the default to be ispc's default, but we can add cargo features (e.g. ispc_svml, ispc_libm, ispc_builtin_math...) that select a different math library in the build.rs.

I'd prefer if you do this in a different PR, because it applies to many benchmarks and I'd like to merge this ;)

@GabrielMajeri
Copy link
Contributor Author

GabrielMajeri commented Sep 4, 2018

@gnzlbg Clippy failed, but I don't think it's related to this PR.

I'd prefer if you do this in a different PR, because it applies to many benchmarks and I'd like to merge this ;)

Alright, I'll see if I can't get cross-language inlining to work, maybe we can use sleef to beat ISPC's math library. Are you on IRC ?

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 4, 2018

Clippy failed, but I don't think it's related to this PR.

I'll fix that :)

Alright, I'll see if I can't get cross-language inlining to work, maybe we can use sleef to beat ISPC's math library.

I have a branch where I started doing this the other day. I can upload it if you want to pick it up from there.

@gnzlbg gnzlbg merged commit a1406b1 into rust-lang:master Sep 4, 2018
@GabrielMajeri GabrielMajeri deleted the perf branch September 4, 2018 11:34
@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 4, 2018

@GabrielMajeri i'm gnzlbg on IRC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants