-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ptr_rotate
faster
#61784
Comments
One thing I suggest including that I didn't see in your benchmark list: Anything that couldn't do bigger chunks ended up with bad performance rotating those last time I was experimenting, presumably because it needed 6 byte-level memory operations per move. |
I renamed some benchmarks and added the RGB test. The new algorithm performs better on Intel at least. I just remembered that the AMD CPU is extremely inconsistent with microbenchmarks, with the measurements swinging wildly. I think I have done all I can do, unless there is some SIMD trickery I can apply. My algorithm 1 is still better than using algorithm 2 for I copied the |
I see you are using Windows which doesn't handle high core counts well. It got better with Windows 10 1903 but there is still room to improve. I can run benchmarks on Linux with Ryzen 2nd gen if you want. |
Yes thank you. Edit: don't benchmark it yet, I need to fix some problems I found |
Ok, I pushed a few benchmarking fixes to the repository. The benches no longer clone and the performance difference is much more apparent. |
What I plan on doing is replacing the old
Does this sound good? |
Here are results of nightly 2019-06-13 on box with 2700X running Ubuntu 19.04: https://gist.github.com/mati865/c8954a2c03ca7a10a71ebb54a229ae34 |
The size of the slice appears to have a significant effect on performance, even when the algorithm takes the same branches. For example, running
gives
These values change when I repeat the benchmarks, but the simple change of 1031 -> 1032 consistently improves for my algorithm, and consistently is worse for the old algorithm. I'm not sure what is causing this, even when inserting
|
Would this have something to do with cache lines? |
Improve `ptr_rotate` performance, tests, and benches The corresponding issue is rust-lang#61784. I am not actually sure if miri can handle the test, but I can change the commit if necessary.
Improve `ptr_rotate` performance, tests, and benches The corresponding issue is #61784. I am not actually sure if miri can handle the test, but I can change the commit if necessary.
It was finally merged. I fixed some alignment problems and added a modification to the algorithm 1 branch. |
I was going to directly PR the performance improvement, but since this is my first time writing a bunch of unsafe code, and because
x.py
will not work for me (#61611), I decided to open an issue and point to this repository filled with all the benchmarks I wrote.I tried over 20 different variations on algorithms before settling on this one.
The only thing I am concerned with is that I am directly using
T
values in algorithm 1 instead of indirect copies and swaps, which the microbenchmarks did not like. I am pretty sure that it is panic safe, since there is noT: clone
anywhere.I benchmarked on both Intel and AMD CPUs, and found that only on the
x1031s149_new
benchmark did my algorithm perform significantly worse (probably due to the fact that 1031 and 149 are both primes), and theu8_new
benchmark on my AMD CPU. I do not know what is causing theu8_new
to be much worse on the new algorithm on my AMD CPU, and if the problem persists on newer AMD hardware. Every other benchmark indicates that my algorithm is as fast or faster than the old one (note that because there are so many benchmarks, there may be outliers which you should benchmark again).Of course, when I do the PR I am not going to include all of those benchmarks, but I am not sure what subset to include.
The text was updated successfully, but these errors were encountered: