-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements for benchmarks #115
Conversation
I think this can be updated to use the new |
77d2688
to
590a9e4
Compare
e822d87
to
77bfb73
Compare
LLVM would otherwise generate a lot of memory accesses.
While it's not part of the benchmark per se, it was still slow and showing up too much in the profiler.
I'd decided to compare oranges to oranges and use The results are in: Rust: 4170 ms So Rust is faster than ISPC. I don't know if it's due to a lack of inlining (#126), or some issue with sleef, but it's clear there's not much more we can do to optimize the Rust code itself. We're only limited by the speed of math functions. |
@@ -158,7 +170,8 @@ pub fn x8( | |||
) { | |||
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))] | |||
unsafe { | |||
if is_x86_feature_detected!("avx2") { | |||
if is_x86_feature_detected!("avx2") && is_x86_feature_detected!("fma") | |||
{ | |||
#[rustfmt::skip] | |||
x8_impl_avx2(t0, t1, x0, x1, y0, y1, z0, z1, n_x, n_y, n_z, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here for avx +- fma
Note that ISPC inline math functions are not as precise as sleef, libm, SVML, etc. Let me know when you finish the nitpicks and this is green so that we can merge this. I think it looks really good. I'll run the benchmarks on some systems and update the readme's afterwards (you can also add the results for your own systems in a follow up PR if you want). |
Also: THANKS for doing this! |
Also enables FMA for the tiled implementation
Alright, just fixed a little big issue with the RNG: those functions were not marked as I think I've addressed most of the issues I could, some final bechmark results here (with
Side note: would it be fair to modify |
Damn, this is a know problem, and we have no lints against this yet :(
I'd prefer the default to be ispc's default, but we can add cargo features (e.g. I'd prefer if you do this in a different PR, because it applies to many benchmarks and I'd like to merge this ;) |
@gnzlbg Clippy failed, but I don't think it's related to this PR.
Alright, I'll see if I can't get cross-language inlining to work, maybe we can use sleef to beat ISPC's math library. Are you on IRC ? |
I'll fix that :)
I have a branch where I started doing this the other day. I can upload it if you want to pick it up from there. |
@GabrielMajeri i'm gnzlbg on IRC |
This pull request contains improvements to the performance of the example crates in this repository.
fannkuch_redux
I noticed a small performance reduction because LLVM was emitting memory accesses to the
self.odd
field, insted of storing it in a register.Not sure why this happens, LLVM should know we borrow
self
mutably & have exclusive access toself
and be able to cache its value locally.Overall, about 1.5% better performance.
stencil
See issue #95 for the full discussion.
mul_adde
(FMA) where possible.Data::reinit
by using unsafe memory accesses. Not part of the benchmark, but it was slow and showing up a lot in the profiler.I guess that's... good?
Failing tests
aobench
mul_adde
to most places, matching ISPC's assembly.