Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a faster stable sort algorithm #90545

Closed
wants to merge 3 commits into from

Conversation

wpwoodjr
Copy link

@wpwoodjr wpwoodjr commented Nov 3, 2021

Hi everyone, this is my first PR.

This PR replaces slice's TimSort with a stable two stage merge sort having pre-sorted prefix optimization. The total running time is O(n * log(n)) worst-case.

Highlights:

  • Improved performance, especially on random and pre-sorted inputs.
  • Improved performance on other patterns such as shuffle and stagger.
  • Performs fewer comparisons on random and most other patterns.

Benchmark:
The benchmark was run on six OS's / CPUs:

  • Ubuntu 20.04.3 LTS / Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  • WIN 10 KVM on Ubuntu / Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  • macOS Big Sur Version 11.6 / MacBook Pro (16-inch, 2019) 2.4 GHz 8-Core Intel Core i9
  • Debian Crostini Linux / Google PixelBook Intel(R) 7th Gen Core(TM) i7-7Y75
  • Debian Crostini Linux / Asus Chromebook Flip CX5 Intel(R) 11th Gen Core(TM) i5-1130G7
  • Debian Crostini Linux / HP Chromebook X2 11 Qualcomm(R) Snapdragon(TM) 7c

For each OS/CPU, the benchmark was run on vecs of i16, i32, i64, i128, and String. The vecs were of random lengths from 0 to 10,000,000.

The proposed two stage merge sort indicates a speedup of 13.1% for random unsorted data, with 95% of sort runs being faster. For random data, including unsorted and forward / reverse sorted variants, the results indicate a speedup of 20.2%, with 92% of sort runs being faster. Other data patterns are faster too, except for the sawtooth pattern, which is about the same. The number of comparisons performed is -2.92% less over all tests.

Note: The above benchmark figures have been corrected since my original post. The original post did not include sort results for vecs of length 0-100.

Full results and benchmark code are at wpwoodjr/rust-merge-sort

The new sort
While not really new, the proposed sort benefits from:

  • Top-down recursive depth-first merge, which helps data locality
  • Small slices sort using a fast insertion sort, then merge
  • Pre-sorted prefix optimization for forward and reverse sorted sub-slices

The new sort builds on existing code wherever possible.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @kennytm (or someone else) soon.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Nov 3, 2021
@wpwoodjr wpwoodjr changed the title Implement a faster sort stable algorithm Implement a faster stable sort algorithm Nov 3, 2021
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@kennytm kennytm added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Nov 4, 2021
@joshtriplett
Copy link
Member

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Nov 4, 2021
@bors
Copy link
Contributor

bors commented Nov 4, 2021

⌛ Trying commit 858462c3317ec419fe2503d4aa3812ddb90b7079 with merge eef64e8caf17ac76641c91a2409cc93ba7e381b4...

library/alloc/src/slice.rs Show resolved Hide resolved
library/alloc/src/slice.rs Outdated Show resolved Hide resolved
library/alloc/src/slice.rs Show resolved Hide resolved
library/alloc/src/slice.rs Outdated Show resolved Hide resolved
library/alloc/src/slice.rs Show resolved Hide resolved
@bors
Copy link
Contributor

bors commented Nov 4, 2021

☀️ Try build successful - checks-actions
Build commit: eef64e8caf17ac76641c91a2409cc93ba7e381b4 (eef64e8caf17ac76641c91a2409cc93ba7e381b4)

@rust-timer
Copy link
Collaborator

Queued eef64e8caf17ac76641c91a2409cc93ba7e381b4 with parent 4061c04, future comparison URL.

library/alloc/src/slice.rs Outdated Show resolved Hide resolved
@sanmai-NL
Copy link

sanmai-NL commented Nov 4, 2021

Before even considering performance improvements, which I personally find very important, I do hope the Rust maintainers get convincing fuzz test results, and possibly apply AddressSanitizer and MemorySanitizer.

Furthermore, can the insertion sort be factored out or reused? Maybe precomputed sorting networks would be even better.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (eef64e8caf17ac76641c91a2409cc93ba7e381b4): comparison url.

Summary: This change led to small relevant mixed results 🤷 in compiler performance.

  • Small improvement in instruction counts (up to -0.9% on incr-full builds of html5ever)
  • Small regression in instruction counts (up to 1.4% on incr-patched: println builds of cargo)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Nov 4, 2021
@the8472
Copy link
Member

the8472 commented Nov 4, 2021

The git history contains merge commits, you'll have to rebase that.

https://rustc-dev-guide.rust-lang.org/contributing.html#opening-a-pr

@wpwoodjr
Copy link
Author

wpwoodjr commented Nov 4, 2021

Before even considering performance improvements, which I personally find very important, I do hope the Rust maintaincers get convincing fuzz test results, and possibly applying AddressSanitizer and MemorySanitizer.

The benchmarking code includes a fuzz test of sorts. Pass eq and the number of runs:

$ target/release/newsort eq --nruns 40
Std: Std, Newsort: Newsort
Range 10 to 10000
String array size = 1095696
Running sort equality test with 40 runs...
20 equality test runs completed
40 equality test runs completed; new to standard comparisons ratio: 0.9848

Each run sorts over 900 randomly sized arrays between 0 and 100,000 in length.

Furthermore, can the insertion sort be factored out or reused? Maybe precomputed sorting networks would be even better.

I started out with that idea too, but insertion sort was the fastest.

@wpwoodjr
Copy link
Author

wpwoodjr commented Nov 4, 2021

The git history contains merge commits, you'll have to rebase that.

https://rustc-dev-guide.rust-lang.org/contributing.html#opening-a-pr

Thanks. I'm not sure how to do that. I have my fork locally. I think I made a mistake by fetching upstream changes between my commits. How do I fix it?

@the8472
Copy link
Member

the8472 commented Nov 4, 2021

Rebasing automatically removes merge commits.

git remote add upstream git@github.com:rust-lang/rust.git # only if you haven't added it already
git fetch upstream
git rebase upstream/master # this will prompt you to resolve conflicts if needed
git push -f

Add -i to reorder, reword, edit or squash individual commits

git rebase -i upstream/master

@wpwoodjr wpwoodjr force-pushed the master branch 3 times, most recently from 59ee0dd to cf6d57d Compare November 11, 2021 15:18
@ghost
Copy link

ghost commented May 30, 2022

@wpwoodjr Thank you for the detailed answer! I just saw it now because I didn't get notifications. That sounds quite good so far. There is one thing that came to my mind recently:
Currently, the TimSort-based implementation doesn't implement Galloping in the merging operation. There is a new paper that shows that implementing Galloping won't have a too negative effect in general and will improve the performance in many cases.

At the end of the paper it says:

We also hope that this simpler framework will foster the interest for the galloping merging sub-routine of
TimSort, and possibly lead to amending Swift and Rust implementations of TimSort to include that sub-routine,
which we believe is too efficient in relevant cases to be omitted.

The PR author from the current stable sort also mentioned Galloping:

TimSort is another option. It allocates a buffer of size n/2, which is not great, but acceptable. Performs extremelly well on partially sorted inputs. However, it seems pretty much all implementations suck on random inputs. I benchmarked implementations in Rust, C++, and D. The results were a bit disappointing. It seems bad performance is due to complex galloping procedures in hot loops. Galloping noticeably improves performance on partially sorted inputs, but worsens it on random ones.

I have not implemented it so far, but compared the non-galloping current stable sort to stable in place symmerge (which uses binary searching and probably could use galloping, too). In many cases symmerge will outperform the non-galloping current stable sort by factor 2 to 3. I assume, that after the implementation of galloping for stable sort, symmerge will not be faster anymore and stable sort will be more than 3 times faster than the current implementation for inputs, where runs are merged that have different lengths (the bigger distance between the lengths, the better). Since you are using the same current merge function, it would be interesting to know, if the galloping search could benefit in your approach, too. And if not, if your approach is still better than the current TimSort with Galloping. I'm not too deep into the topic, so I can't say it for sure :(

I've also developed an hybrid approach to make current stable sort use constant memory instead of N/2 and still be fast. But this will also work with your code, so it's not important here.

@thomcc
Copy link
Member

thomcc commented May 30, 2022

CC @orlp, who also had some improvements to the stable sort code that he was interested in PRing.

@orlp
Copy link
Contributor

orlp commented May 30, 2022

who also had some improvements to the stable sort code that he was interested in PRing

@thomcc @tstj

Well, yesn't. I am implementing a complete new stable algorithm from scratch (glidesort), which goes way beyond "some improvements" (and thus would also be a significantly larger PR to review). I have a working version without panic safety, and am working on a version with full panic safety with less unsafe code (which is getting delayed due to personal circumstances). I recently held a talk about glidesort here: https://www.youtube.com/watch?v=2y3IK1l6PI4

The results are thus also much more impactful. On my machine (2021 Apple M1 Pro) the speedups are such that glidesort is faster than sort_unstable for random data, and 3.5x faster than sort:

image

When dealing with a low cardinality dataset (under projection of the comparison operator), e.g. sorting by age stored in integer years, which will likely only have ~120 unique values, glidesort becomes faster still (benchmark is 256 unique values by masking integers with 0xFF before comparing):

image

@ghost
Copy link

ghost commented May 30, 2022

@orlp Congratulations to this algorithm, this sounds very promising!
Maybe I missed some points in your presentation, but I think you have not said which N*Log(N)*Log(N) fallback you are using. Is it Symmerge? If so, you could probably still improve it, by helping out the rotation algorithm by the (at this point probably unused) extra buffer. I have implemented a similar (in the sense of the fallback only) hybrid method (but based on current stable sort and Symmerge) which could benefit from this when sizeof(T) is appropriate (>128, otherwise default rotate was faster). In my version, the "append 1%" is much faster (~2-3 times) than the Rust version, because I let Symmerge solve the problem, once the length difference is large (that probably only works better because of missing galloping in current Rust version).

In your presentation you say, that GlideSort will assume cheap comparisons. Will it become slower than another algorithm, when the comprisons are not cheap (for example image comparisons)?
You also say that cheap moves (I assume small sizeof(T) is meant) are assumed.
In the cases where sizeof(T) > ~384 Symmerge with Buffer will slow down and will become slower than GrailSort (only on random data) because of it's N*Log(N)*Log(N) worst case on moves. It would be interesting if your algorithm experiences the same problems and will be N*Log(N)*Log(N) if sizeof(T) is larger and a constant buffer (e.g. 1024) is used. That was the problem with my code...

It would be interesting to see tests when sizeof(T) >= 256 or even sizeof(T) >= 384, in my opinion, a sizeof(T) < sizeof(usize) is not so common and thus not so important for a sorting algorithm but of course a very good feature.

Also, during my research I found your shift_head and shift_tail, from pdqsort, I think they do linear searches. Could they do galloping, too?

@orlp
Copy link
Contributor

orlp commented May 30, 2022

Maybe I missed some points in your presentation, but I think you have not said which N*Log(N)*Log(N) fallback you are using. Is it Symmerge?

I did briefly talk about it on the "creating more parallelism slide": https://youtu.be/2y3IK1l6PI4?t=727

I tried symmerge but it was slower and more complicated. I use a simplified scheme based on symmerge that uses block swaps rather than rotations. This is slightly less optimal in terms of comparisons but the smaller, simpler code and better codegen for blockswaps compared to rotations more than made up for it. It also made implied block swaps simpler.

The logic I use is very similar to symmerge. I find the "crossover point" using binary search:

// Given sorted left, right of equal size n it finds smallest i such that
// for all l in left[i..] and r in right[..n-i] we have l > r. By vacuous truth
// n is always a solution (but not necessarily the smallest).
//
// Panics if left, right aren't equal in size or if is_less does.
pub fn crossover_point<T, F: Cmp<T>>(left: &[T], right: &[T], is_less: &mut F) -> usize { ... }

Which then partitions an array as such:

    let minlen = left.len().min(right.len());
    let left_skip = left.len() - minlen;
    let i = crossover_point(
        &left.as_mut_slice()[left_skip..],
        &right.as_mut_slice()[..minlen],
        is_less,
    );
    let (left0, left1) = left.split_at(left_skip + i).unwrap_abort();
    let (right0, right1) = right.split_at(minlen - i).unwrap_abort();
    (left0, left1, right0, right1)

Then merging left0, right0 followed by left1, right1 is equivalent to stably merging left, right. Note that we don't always have to actually swap left1 and right0 - if we have the scratch space available all we do is reassign destination/source buffers. I call this an implied swap.

In your presentation you say, that GlideSort will assume cheap comparisons. Will it become slower than another algorithm, when the comprisons are not cheap (for example image comparisons)?

If comparisons dominate runtime entirely glidesort pretty much uses the same number of comparisons as sort currently does for random data. For patterns it's going to depend on exactly the pattern, overall I'd say better.

Note that if your comparison operator is exceptionally expensive like image comparison, you are much better suited with an algorithm that gets much closer to optimality than mergesort, like merge-insertion sort.

In the cases where sizeof(T) > ~384 Symmerge with Buffer will slow down and will become slower than GrailSort (only on random data) because of it's N*Log(N)*Log(N) worst case on moves.

I honestly don't really care about benchmarking the performance for sorting such large structs/arrays, because if anyone actually cared for the performance of sorting those they would be using indirection. Constantly shuffling such large structures around during sorting will never be performant. But yes, with a fixed-size buffer glidesort is O(n log(n)²).

You also say that cheap moves (I assume small sizeof(T) is meant) are assumed.

I am generally focusing on objects that fit in a couple machine words. E.g. strings, integers, an (x, y, z) f64 triple, a struct of two integers and a string, etc. Larger objects reduce to this case through indirection. Your cache performance will likely suffer with indirection, but then again, you're really not getting that much cache benefits with large objects anyway.

It would be interesting to see tests when sizeof(T) >= 256 or even sizeof(T) >= 384, in my opinion, a sizeof(T) < sizeof(usize) is not so common and thus not so important for a sorting algorithm but of course a very good feature.

I hold the exact opposite opinion, that sizeof(T) <= sizeof(usize) * 8 is by far the most important domain to benchmark for a generic system sort.

Also, during my research I found your shift_head and shift_tail, from pdqsort, I think they do linear searches. Could they do galloping, too?

Neither of these functions are found in pdqsort, so I don't know. And traditional 'galloping' is mostly a waste of time for a system sort in my opinion. It is relatively complex, mostly improves very academic notions of pre-sortedness that have little to no real-world justification (note how the new paper you cited does not contain any real-world benchmarks, or even synthetic benchmarks at all), and slows down the hottest core loops. There are some tricks similar to galloping (which I'd rather call 'optimistic binary search') I still might fit in at some points out of the core loops, but I really doubt they would benefit much.

Most of the points you bring up very much come from the old school of thinking about sorting that glidesort (and pdqsort) precisely tries to puncture. Real-world system sorting isn't about minimizing comparisons at all. It's not about moves either, for the most part. It's about minimizing time. And most of the time is gained or lost in (instruction-level) parallelism, branch mispredictions and cache misses. That's why glidesort is 3.5x faster than the current slice::sort for random u32s (same for u64s) on my machine, instead of just a modest couple %, or only on a specific distribution.

@ghost
Copy link

ghost commented May 30, 2022

Thank you for the detailed answer, sounds really good to me!

I hold the exact opposite opinion, that sizeof(T) <= sizeof(usize) * 8 is by far the most important domain to benchmark for a generic system sort.

I would agree to this, on my machine sizeof(usize) * 8 is 512 (I meant bits not bytes), that's why I mentioned the sizes 256 and 384 ;)

@orlp
Copy link
Contributor

orlp commented May 30, 2022

I would agree to this, on my machine sizeof(usize) * 8 is 512 (I meant bits not bytes), that's why I mentioned the sizes 256 and 384 ;)

Ah alright, I thought you meant bytes.

@yjhn
Copy link
Contributor

yjhn commented Jun 8, 2022

If we are discussing possible sort improvements, maybe vectorizing would be of interest? This might be possible to implement using portable SIMD (probably not, though, as in the linked blog post they use SIMD runtime dispatch, which would probably not work with Rust's portable SIMD).

@orlp
Copy link
Contributor

orlp commented Jun 8, 2022

If we are discussing possible sort improvements, maybe vectorizing would be of interest? This might be possible to implement using portable SIMD (probably not, though, as in the linked blog post they use SIMD runtime dispatch, which would probably not work with Rust's portable SIMD).

Vectorized quicksort / sorting networks have been around for quite a while, but they're pretty much limited to sorting small integers without custom comparison functions. Not very applicable to a generic sort for arbitrary types with arbitrary comparison functions. Maybe in the future we can specialize for it, but I wouldn't consider it high priority.

@scottmcm
Copy link
Member

scottmcm commented Jun 8, 2022

  • It does not track and merge "runs". However, on each call to slice_merge_sort, the beginning of the slice is checked for the number of already sorted elements. These already sorted elements are never re-checked or redundantly sorted, saving considerable time.

Does it also do this for the end of the slice? One of the things that's nice about the current sort is that you can extend a sorted collection with another once, call sort, and know it's going to do nearly as well as calling a dedicated merge function -- which std doesn't yet have. (Or, in general, chain a bunch of them and have the sort adaptively take advantage of the work that was already done.)

@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 3, 2022
@Logarithmus
Copy link
Contributor

Logarithmus commented Jul 20, 2022

@wpwoodjr https://spin.atomicobject.com/2020/05/05/git-configurations-default/
TLDR; git config --global pull.rebase true will make your git history much cleaner

@Voultapher
Copy link
Contributor

Looking at the total comparisons done:

Percent comparisons done more by wpwoodjr_stable than std_stable
[1k-all_equal-20-plus]:               median: 0%    min: 0%    max: 0%
[1k-all_equal-20-sub]:                median: 0%    min: 0%    max: 0%
[1k-ascending-20-plus]:               median: 0%    min: 0%    max: 0%
[1k-ascending-20-sub]:                median: 0%    min: 0%    max: 0%
[1k-ascending_saw_20-20-plus]:        median: 13%   min: 0%    max: 29%
[1k-ascending_saw_20-20-sub]:         median: 0%    min: 0%    max: 0%
[1k-ascending_saw_5-20-plus]:         median: 4%    min: -9%   max: 13%
[1k-ascending_saw_5-20-sub]:          median: 0%    min: 0%    max: 4%
[1k-descending-20-plus]:              median: 0%    min: 0%    max: 0%
[1k-descending-20-sub]:               median: 0%    min: 0%    max: 0%
[1k-descending_saw_20-20-plus]:       median: 5%    min: 0%    max: 19%
[1k-descending_saw_20-20-sub]:        median: 0%    min: 0%    max: 0%
[1k-descending_saw_5-20-plus]:        median: -3%   min: -10%  max: 15%
[1k-descending_saw_5-20-sub]:         median: 0%    min: -5%   max: 0%
[1k-pipe_organ-20-plus]:              median: 1%    min: 0%    max: 9%
[1k-pipe_organ-20-sub]:               median: 14%   min: 8%    max: 50%
[1k-random-20-plus]:                  median: 0%    min: -2%   max: 4%
[1k-random-20-sub]:                   median: 0%    min: 0%    max: 2%
[1k-random_random_size-20-plus]:      median: -3%   min: -11%  max: 6%
[1k-random_random_size-20-sub]:       median: 0%    min: -17%  max: 17%
[1k-random_uniform-20-plus]:          median: 0%    min: -2%   max: 4%
[1k-random_uniform-20-sub]:           median: 0%    min: -4%   max: 3%
[f128-all_equal-20-plus]:             median: 0%    min: 0%    max: 0%
[f128-all_equal-20-sub]:              median: 0%    min: 0%    max: 0%
[f128-ascending-20-plus]:             median: 0%    min: 0%    max: 0%
[f128-ascending-20-sub]:              median: 0%    min: 0%    max: 0%
[f128-ascending_saw_20-20-plus]:      median: 13%   min: 0%    max: 29%
[f128-ascending_saw_20-20-sub]:       median: 0%    min: 0%    max: 0%
[f128-ascending_saw_5-20-plus]:       median: 4%    min: -9%   max: 13%
[f128-ascending_saw_5-20-sub]:        median: 0%    min: 0%    max: 3%
[f128-descending-20-plus]:            median: 0%    min: 0%    max: 0%
[f128-descending-20-sub]:             median: 0%    min: 0%    max: 0%
[f128-descending_saw_20-20-plus]:     median: 5%    min: 0%    max: 19%
[f128-descending_saw_20-20-sub]:      median: 0%    min: 0%    max: 0%
[f128-descending_saw_5-20-plus]:      median: -3%   min: -10%  max: 17%
[f128-descending_saw_5-20-sub]:       median: 0%    min: -5%   max: 0%
[f128-pipe_organ-20-plus]:            median: 1%    min: -2%   max: 9%
[f128-pipe_organ-20-sub]:             median: 18%   min: 9%    max: 50%
[f128-random-20-plus]:                median: 0%    min: -2%   max: 5%
[f128-random-20-sub]:                 median: 0%    min: -4%   max: 10%
[f128-random_random_size-20-plus]:    median: 1%    min: -7%   max: 11%
[f128-random_random_size-20-sub]:     median: 0%    min: -8%   max: 17%
[f128-random_uniform-20-plus]:        median: 0%    min: -2%   max: 2%
[f128-random_uniform-20-sub]:         median: 0%    min: -2%   max: 3%
[i32-all_equal-20-plus]:              median: 0%    min: 0%    max: 0%
[i32-all_equal-20-sub]:               median: 0%    min: 0%    max: 0%
[i32-ascending-20-plus]:              median: 0%    min: 0%    max: 0%
[i32-ascending-20-sub]:               median: 0%    min: 0%    max: 0%
[i32-ascending_saw_20-20-plus]:       median: 10%   min: 0%    max: 29%
[i32-ascending_saw_20-20-sub]:        median: 0%    min: 0%    max: 0%
[i32-ascending_saw_5-20-plus]:        median: 3%    min: -10%  max: 13%
[i32-ascending_saw_5-20-sub]:         median: 0%    min: 0%    max: 3%
[i32-descending-20-plus]:             median: 0%    min: 0%    max: 0%
[i32-descending-20-sub]:              median: 0%    min: 0%    max: 0%
[i32-descending_saw_20-20-plus]:      median: 5%    min: 0%    max: 19%
[i32-descending_saw_20-20-sub]:       median: 0%    min: 0%    max: 0%
[i32-descending_saw_5-20-plus]:       median: -2%   min: -9%   max: 17%
[i32-descending_saw_5-20-sub]:        median: 0%    min: -5%   max: 0%
[i32-pipe_organ-20-plus]:             median: 0%    min: -37%  max: 9%
[i32-pipe_organ-20-sub]:              median: 18%   min: 8%    max: 50%
[i32-random-20-plus]:                 median: 0%    min: -2%   max: 7%
[i32-random-20-sub]:                  median: 0%    min: -1%   max: 2%
[i32-random_random_size-20-plus]:     median: 1%    min: -4%   max: 17%
[i32-random_random_size-20-sub]:      median: 0%    min: -6%   max: 25%
[i32-random_uniform-20-plus]:         median: 0%    min: -2%   max: 2%
[i32-random_uniform-20-sub]:          median: 0%    min: -4%   max: 3%
[string-all_equal-20-plus]:           median: 0%    min: 0%    max: 0%
[string-all_equal-20-sub]:            median: 0%    min: 0%    max: 0%
[string-ascending-20-plus]:           median: 0%    min: 0%    max: 0%
[string-ascending-20-sub]:            median: 0%    min: 0%    max: 0%
[string-ascending_saw_20-20-plus]:    median: -1%   min: -6%   max: 20%
[string-ascending_saw_20-20-sub]:     median: -12%  min: -20%  max: 0%
[string-ascending_saw_5-20-plus]:     median: 0%    min: -4%   max: 6%
[string-ascending_saw_5-20-sub]:      median: -3%   min: -20%  max: 0%
[string-descending-20-plus]:          median: 0%    min: 0%    max: 0%
[string-descending-20-sub]:           median: 0%    min: 0%    max: 0%
[string-descending_saw_20-20-plus]:   median: -1%   min: -7%   max: 20%
[string-descending_saw_20-20-sub]:    median: -11%  min: -20%  max: 0%
[string-descending_saw_5-20-plus]:    median: -1%   min: -4%   max: 5%
[string-descending_saw_5-20-sub]:     median: -7%   min: -25%  max: 0%
[string-pipe_organ-20-plus]:          median: 4%    min: -10%  max: 5%
[string-pipe_organ-20-sub]:           median: -3%   min: -9%   max: 0%
[string-random-20-plus]:              median: 0%    min: -2%   max: 5%
[string-random-20-sub]:               median: 0%    min: -2%   max: 1%
[string-random_random_size-20-plus]:  median: -2%   min: -8%   max: 10%
[string-random_random_size-20-sub]:   median: 0%    min: -11%  max: 13%
[string-random_uniform-20-plus]:      median: 0%    min: -2%   max: 2%
[string-random_uniform-20-sub]:       median: 0%    min: -3%   max: 2%
[u64-all_equal-20-plus]:              median: 0%    min: 0%    max: 0%
[u64-all_equal-20-sub]:               median: 0%    min: 0%    max: 0%
[u64-ascending-20-plus]:              median: 0%    min: 0%    max: 0%
[u64-ascending-20-sub]:               median: 0%    min: 0%    max: 0%
[u64-ascending_saw_20-20-plus]:       median: 10%   min: 0%    max: 29%
[u64-ascending_saw_20-20-sub]:        median: 0%    min: 0%    max: 0%
[u64-ascending_saw_5-20-plus]:        median: 3%    min: -10%  max: 13%
[u64-ascending_saw_5-20-sub]:         median: 0%    min: 0%    max: 4%
[u64-descending-20-plus]:             median: 0%    min: 0%    max: 0%
[u64-descending-20-sub]:              median: 0%    min: 0%    max: 0%
[u64-descending_saw_20-20-plus]:      median: 5%    min: 0%    max: 19%
[u64-descending_saw_20-20-sub]:       median: 0%    min: 0%    max: 0%
[u64-descending_saw_5-20-plus]:       median: -3%   min: -9%   max: 17%
[u64-descending_saw_5-20-sub]:        median: 0%    min: -5%   max: 0%
[u64-pipe_organ-20-plus]:             median: 0%    min: -36%  max: 8%
[u64-pipe_organ-20-sub]:              median: 12%   min: 7%    max: 50%
[u64-random-20-plus]:                 median: 0%    min: -2%   max: 5%
[u64-random-20-sub]:                  median: 0%    min: -9%   max: 7%
[u64-random_random_size-20-plus]:     median: -3%   min: -16%  max: 8%
[u64-random_random_size-20-sub]:      median: 0%    min: 0%    max: 17%
[u64-random_uniform-20-plus]:         median: 0%    min: -2%   max: 2%
[u64-random_uniform-20-sub]:          median: 0%    min: -2%   max: 2%

The result suggests a rather minimal overall change. Which is good.

@Voultapher
Copy link
Contributor

Running my test suite miri complains about UB:

cargo miri test random

Seed: 17511317657358896758
error: Undefined Behavior: attempting a read access using <389996> at alloc165698[0x0], but that tag does not exist in the borrow stack for this location
   --> /projects/rust/sort-research-rs/src/wpwoodjr_stable_sort/mod.rs:144:13
    |
144 |             ptr::copy_nonoverlapping(hole.dest, v.get_unchecked_mut(end), 1);
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |             |
    |             attempting a read access using <389996> at alloc165698[0x0], but that tag does not exist in the borrow stack for this location
    |             this error occurs as part of an access at alloc165698[0x0..0x4]
    |
    = help: this indicates a potential bug in the program: it performed an invalid operation, but the Stacked Borrows rules it violated are still experimental
    = help: see https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md for further information
help: <389996> was created by a retag at offsets [0x0..0x4]
   --> /projects/rust/sort-research-rs/src/wpwoodjr_stable_sort/mod.rs:142:23
    |
142 |                 dest: v.get_unchecked_mut(end - 1),
    |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
help: <389996> was later invalidated at offsets [0x0..0x8]
   --> /projects/rust/sort-research-rs/src/wpwoodjr_stable_sort/mod.rs:144:49
    |
144 |             ptr::copy_nonoverlapping(hole.dest, v.get_unchecked_mut(end), 1);
    |                                                 ^^^^^^^^^^^^^^^^^^^^^^^^
    = note: backtrace:
    = note: inside `sort_comp::wpwoodjr_stable_sort::insert_end::<i32, [closure@sort_comp::wpwoodjr_stable_sort::sort_by<i32, [closure@tests/main.rs:56:52: 56:58]>::{closure#0}]>` at /projects/rust/sort-research-rs/src/wpwoodjr_stable_sort/mod.rs:144:13
note: inside `sort_comp::wpwoodjr_stable_sort::merge_sort::<i32, [closure@sort_comp::wpwoodjr_stable_sort::sort_by<i32, [closure@tests/main.rs:56:52: 56:58]>::{closure#0}]>` at /projects/rust/sort-research-rs/src/wpwoodjr_stable_sort/mod.rs:339:13
   --> /projects/rust/sort-research-rs/src/wpwoodjr_stable_sort/mod.rs:339:13
    |
339 |             insert_end(&mut v[..=i], &mut is_less);
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `sort_comp::wpwoodjr_stable_sort::sort_by::<i32, [closure@tests/main.rs:56:52: 56:58]>` at /projects/rust/sort-research-rs/src/wpwoodjr_stable_sort/mod.rs:95:5
   --> /projects/rust/sort-research-rs/src/wpwoodjr_stable_sort/mod.rs:95:5
    |
95  |     merge_sort(arr, |a, b| compare(a, b) == Ordering::Less);

@Voultapher
Copy link
Contributor

As noted in the other PR I ran my benchmark suite, but only zen3, with this implementation and got:

Screenshot 2022-08-24 at 18 22 18

A statistically significant speedup in most cases, with mostly ascending_saw and descending_saw seeing regressions. Levelling out at around 6% for random u64 inputs hot.

Screenshot 2022-08-24 at 18 22 44

Strings see smaller improvements, but overall a slight improvement. You can find the full results here.

@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 8, 2022
@bors
Copy link
Contributor

bors commented Jan 21, 2023

☔ The latest upstream changes (presumably #107143) made this pull request unmergeable. Please resolve the merge conflicts.

@KittyBorgX
Copy link
Member

@rustbot author
@wpwoodjr Hey! It looks like there are some merge conflicts, can you sort them out?
Also, any update on the status of the PR?

Once you're done doing everything, type in @rustbot review here to change the tag to S-waiting-on-review

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 29, 2023
@wpwoodjr
Copy link
Author

@rustbot author @wpwoodjr Hey! It looks like there are some merge conflicts, can you sort them out? Also, any update on the status of the PR?

Once you're done doing everything, type in @rustbot review here to change the tag to S-waiting-on-review

OK, will try and get to it in the next week or so.

@wpwoodjr
Copy link
Author

  • It does not track and merge "runs". However, on each call to slice_merge_sort, the beginning of the slice is checked for the number of already sorted elements. These already sorted elements are never re-checked or redundantly sorted, saving considerable time.

Does it also do this for the end of the slice? One of the things that's nice about the current sort is that you can extend a sorted collection with another once, call sort, and know it's going to do nearly as well as calling a dedicated merge function -- which std doesn't yet have. (Or, in general, chain a bunch of them and have the sort adaptively take advantage of the work that was already done.)

Yes, it will detect the sorted prefix collection, then merge in the second collection.

@wpwoodjr
Copy link
Author

@rustbot author @wpwoodjr Hey! It looks like there are some merge conflicts, can you sort them out? Also, any update on the status of the PR?
Once you're done doing everything, type in @rustbot review here to change the tag to S-waiting-on-review

OK, will try and get to it in the next week or so.

This will take a bit longer than I anticipated, as the sort routines have been moved to the core library, and I am traveling next week. I will get back to it when I return.

@JohnCSimon
Copy link
Member

@wpwoodjr

ping from triage - can you post your status on this PR? There hasn't been an update in a few months. Thanks!

FYI: when a PR is ready for review, send a message containing
@rustbot ready to switch to S-waiting-on-review so the PR is in the reviewer's backlog.

@JohnCSimon
Copy link
Member

@wpwoodjr
Ping from triage: I'm closing this due to inactivity, Please reopen when you are ready to continue with this.
Note: if you are going to continue please open the PR BEFORE you push to it, else you won't be able to reopen - this is a quirk of github.
Thanks for your contribution.

@rustbot label: +S-inactive

@JohnCSimon JohnCSimon closed this Nov 12, 2023
@rustbot rustbot added the S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. label Nov 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.