Make `ptr_rotate` faster #61784

AaronKutch · 2019-06-12T20:33:14Z

I was going to directly PR the performance improvement, but since this is my first time writing a bunch of unsafe code, and because x.py will not work for me (#61611), I decided to open an issue and point to this repository filled with all the benchmarks I wrote.

I tried over 20 different variations on algorithms before settling on this one.
The only thing I am concerned with is that I am directly using T values in algorithm 1 instead of indirect copies and swaps, which the microbenchmarks did not like. I am pretty sure that it is panic safe, since there is no T: clone anywhere.

I benchmarked on both Intel and AMD CPUs, and found that only on the x1031s149_new benchmark did my algorithm perform significantly worse (probably due to the fact that 1031 and 149 are both primes), and the u8_new benchmark on my AMD CPU. I do not know what is causing the u8_new to be much worse on the new algorithm on my AMD CPU, and if the problem persists on newer AMD hardware. Every other benchmark indicates that my algorithm is as fast or faster than the old one (note that because there are so many benchmarks, there may be outliers which you should benchmark again).

Of course, when I do the PR I am not going to include all of those benchmarks, but I am not sure what subset to include.

The text was updated successfully, but these errors were encountered:

scottmcm · 2019-06-13T01:55:08Z

One thing I suggest including that I didn't see in your benchmark list: struct RGB { r: u8, g: u8, b: u8 }.

Anything that couldn't do bigger chunks ended up with bad performance rotating those last time I was experimenting, presumably because it needed 6 byte-level memory operations per move.

AaronKutch · 2019-06-13T02:37:32Z

I renamed some benchmarks and added the RGB test. The new algorithm performs better on Intel at least. I just remembered that the AMD CPU is extremely inconsistent with microbenchmarks, with the measurements swinging wildly. I think I have done all I can do, unless there is some SIMD trickery I can apply. My algorithm 1 is still better than using algorithm 2 for u8.

I copied the [usize; 32] buffer concept from the old algorithm, and the old algorithm claims it is a stack buffer. ~~Is it actually a stack buffer, because it looks to me like it is a heap buffer with the fact that it has MaybeUninit and pointers obscuring that?~~ edit: I was just confused about there being a pointer to something on the stack. Is there something I can do to speed it up?

mati865 · 2019-06-13T09:38:48Z

I do not know what is causing the u8_new to be much worse on the new algorithm on my AMD CPU, and if the problem persists on newer AMD hardware.

I see you are using Windows which doesn't handle high core counts well. It got better with Windows 10 1903 but there is still room to improve.

I can run benchmarks on Linux with Ryzen 2nd gen if you want.

AaronKutch · 2019-06-13T16:45:30Z

Yes thank you. Edit: don't benchmark it yet, I need to fix some problems I found

AaronKutch · 2019-06-13T22:28:15Z

Ok, I pushed a few benchmarking fixes to the repository. The benches no longer clone and the performance difference is much more apparent.

AaronKutch · 2019-06-13T22:34:27Z

What I plan on doing is replacing the old ptr_rotate implementation with the new one, adding the test commented in the repository to libcore/tests/slice.rs, and adding the following to libcore/benches/slice.rs:

macro_rules! rotate {
    ($fn:ident, $n:expr, $mapper:expr) => {
        #[bench]
        fn $fn(b: &mut Bencher) {
            let mut x = (0usize..$n).map(&$mapper).collect::<Vec<_>>();
            b.iter(|| {
                for s in 0..x.len() {
                    x[..].rotate_right(s);
                }
                black_box(x[0].clone())
            })
        }
    }
}

#[derive(Clone)]
struct Rgb(u8,u8,u8);

rotate!(rotate_u8, 32, |i| i as u8);
rotate!(rotate_rgb, 32, |i| Rgb(i as u8, (i as u8).wrapping_add(7), (i as u8).wrapping_add(42)));
rotate!(rotate_usize, 32, |i| i);
rotate!(rotate_usize_4, 32, |i| [i; 4]);
rotate!(rotate_usize_32, 32, |i| [i; 32]);
rotate!(rotate_usize_33, 32, |i| [i; 33]);

Does this sound good?

mati865 · 2019-06-14T09:01:12Z

Here are results of nightly 2019-06-13 on box with 2700X running Ubuntu 19.04: https://gist.github.com/mati865/c8954a2c03ca7a10a71ebb54a229ae34
Glibc is well optimized so the perf looks good but within margin of error, on less optimized musl new functions are significant win.

AaronKutch · 2019-06-14T15:56:17Z

The size of the slice appears to have a significant effect on performance, even when the algorithm takes the same branches. For example, running

    rotate_bench!(x1031s149_old, x1031s149_new, 1031, 149); //rotate a slice of size 1031 by 149 to the right
    rotate_bench!(x1032s149_old, x1032s149_new, 1032, 149); //rotate a slice of size 1032 by 149 to the right

gives

test benches::x1031s149_new ... bench:         487 ns/iter (+/- 19)
test benches::x1031s149_old ... bench:         409 ns/iter (+/- 31)
test benches::x1032s149_new ... bench:         450 ns/iter (+/- 15)
test benches::x1032s149_old ... bench:         526 ns/iter (+/- 119)

These values change when I repeat the benchmarks, but the simple change of 1031 -> 1032 consistently improves for my algorithm, and consistently is worse for the old algorithm. I'm not sure what is causing this, even when inserting dbg! statements for my algorithm, it looks like this:

1031
[src\main.rs:152] "3.0" = "3.0" //algorithm 3
[src\main.rs:152] left = 882
[src\main.rs:152] right = 149
[src\main.rs:166] "3.1" = "3.1" //algorithm 3 but left < right
[src\main.rs:166] left = 137
[src\main.rs:166] right = 149
[src\main.rs:136] "2" = "2" //algorithm 2
[src\main.rs:136] left = 137
[src\main.rs:136] right = 12
1032
[src\main.rs:152] "3.0" = "3.0"
[src\main.rs:152] left = 883
[src\main.rs:152] right = 149
[src\main.rs:166] "3.1" = "3.1"
[src\main.rs:166] left = 138
[src\main.rs:166] right = 149
[src\main.rs:136] "2" = "2"
[src\main.rs:136] left = 138
[src\main.rs:136] right = 11

AaronKutch · 2019-06-14T20:59:37Z

Would this have something to do with cache lines?

Improve `ptr_rotate` performance, tests, and benches The corresponding issue is rust-lang#61784. I am not actually sure if miri can handle the test, but I can change the commit if necessary.

Improve `ptr_rotate` performance, tests, and benches The corresponding issue is #61784. I am not actually sure if miri can handle the test, but I can change the commit if necessary.

AaronKutch · 2019-08-09T16:47:51Z

It was finally merged. I fixed some alignment problems and added a modification to the algorithm 1 branch.

Centril added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jun 12, 2019

AaronKutch mentioned this issue Jun 18, 2019

Improve ptr_rotate performance, tests, and benches #61937

Merged

AaronKutch closed this as completed Aug 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `ptr_rotate` faster #61784

Make `ptr_rotate` faster #61784

AaronKutch commented Jun 12, 2019 •

edited

Loading

scottmcm commented Jun 13, 2019

AaronKutch commented Jun 13, 2019 •

edited

Loading

mati865 commented Jun 13, 2019

AaronKutch commented Jun 13, 2019 •

edited

Loading

AaronKutch commented Jun 13, 2019

AaronKutch commented Jun 13, 2019

mati865 commented Jun 14, 2019

AaronKutch commented Jun 14, 2019 •

edited

Loading

AaronKutch commented Jun 14, 2019

AaronKutch commented Aug 9, 2019

Make ptr_rotate faster #61784

Make ptr_rotate faster #61784

Comments

AaronKutch commented Jun 12, 2019 • edited Loading

scottmcm commented Jun 13, 2019

AaronKutch commented Jun 13, 2019 • edited Loading

mati865 commented Jun 13, 2019

AaronKutch commented Jun 13, 2019 • edited Loading

AaronKutch commented Jun 13, 2019

AaronKutch commented Jun 13, 2019

mati865 commented Jun 14, 2019

AaronKutch commented Jun 14, 2019 • edited Loading

AaronKutch commented Jun 14, 2019

AaronKutch commented Aug 9, 2019

Make `ptr_rotate` faster #61784

Make `ptr_rotate` faster #61784

AaronKutch commented Jun 12, 2019 •

edited

Loading

AaronKutch commented Jun 13, 2019 •

edited

Loading

AaronKutch commented Jun 13, 2019 •

edited

Loading

AaronKutch commented Jun 14, 2019 •

edited

Loading