Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Array.Sort(T[]) performance #35297

Merged
merged 2 commits into from
Apr 24, 2020

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Apr 22, 2020

#35175 was opened about a 10% regression in sorting throughput (specifically looking just at Int32[]) from .NET Core 3.1 to .NET 5.

On my machine:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.207 (2004/?/20H1)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.4.20211.5

our dotnet/performance tests don't show a regression anywhere near that, e.g.

Type Method Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD
Sort<BigStruct> Array_Comparison \master\corerun.exe 128 4.028 us 0.1816 us 0.1783 us 4.014 us 3.805 us 4.529 us 0.93 0.07
Sort<BigStruct> Array_Comparison \netcore31\corerun.exe 128 4.333 us 0.2711 us 0.2663 us 4.267 us 4.089 us 4.945 us 1.00 0.00
Sort<Int32> Array_Comparison \master\corerun.exe 128 3.277 us 0.7013 us 0.8076 us 2.822 us 2.649 us 4.951 us 0.99 0.11
Sort<Int32> Array_Comparison \netcore31\corerun.exe 128 3.327 us 0.6240 us 0.7186 us 2.848 us 2.668 us 4.293 us 1.00 0.00
Sort<IntClass> Array_Comparison \master\corerun.exe 128 6.481 us 0.1237 us 0.1033 us 6.454 us 6.336 us 6.645 us 1.05 0.02
Sort<IntClass> Array_Comparison \netcore31\corerun.exe 128 6.167 us 0.0816 us 0.0637 us 6.156 us 6.078 us 6.272 us 1.00 0.00
Sort<IntStruct> Array_Comparison \master\corerun.exe 128 3.967 us 1.4690 us 1.6917 us 3.002 us 2.671 us 6.633 us 0.98 0.09
Sort<IntStruct> Array_Comparison \netcore31\corerun.exe 128 4.039 us 1.3931 us 1.6043 us 3.026 us 2.692 us 6.397 us 1.00 0.00
Sort<String> Array_Comparison \master\corerun.exe 128 41.962 us 0.4130 us 0.3449 us 41.896 us 41.473 us 42.598 us 0.78 0.01
Sort<String> Array_Comparison \netcore31\corerun.exe 128 53.607 us 0.2796 us 0.2335 us 53.682 us 53.180 us 53.989 us 1.00 0.00
Sort<BigStruct> Array_Comparison \master\corerun.exe 256 10.402 us 0.2142 us 0.2200 us 10.431 us 10.121 us 10.930 us 0.91 0.04
Sort<BigStruct> Array_Comparison \netcore31\corerun.exe 256 11.442 us 0.2745 us 0.2819 us 11.352 us 11.102 us 11.946 us 1.00 0.00
Sort<Int32> Array_Comparison \master\corerun.exe 256 7.440 us 0.0629 us 0.0525 us 7.430 us 7.367 us 7.532 us 1.00 0.01
Sort<Int32> Array_Comparison \netcore31\corerun.exe 256 7.449 us 0.0787 us 0.0658 us 7.448 us 7.349 us 7.573 us 1.00 0.00
Sort<IntClass> Array_Comparison \master\corerun.exe 256 16.090 us 0.0999 us 0.0835 us 16.108 us 15.961 us 16.187 us 1.04 0.01
Sort<IntClass> Array_Comparison \netcore31\corerun.exe 256 15.492 us 0.1317 us 0.1100 us 15.461 us 15.367 us 15.746 us 1.00 0.00
Sort<IntStruct> Array_Comparison \master\corerun.exe 256 7.741 us 0.0896 us 0.0699 us 7.722 us 7.659 us 7.892 us 1.03 0.01
Sort<IntStruct> Array_Comparison \netcore31\corerun.exe 256 7.528 us 0.1079 us 0.0843 us 7.524 us 7.405 us 7.679 us 1.00 0.00
Sort<String> Array_Comparison \master\corerun.exe 256 95.722 us 0.7744 us 0.6467 us 95.536 us 94.978 us 96.726 us 0.79 0.01
Sort<String> Array_Comparison \netcore31\corerun.exe 256 121.650 us 0.5643 us 0.5002 us 121.748 us 120.844 us 122.655 us 1.00 0.00
Sort<BigStruct> Array_Comparison \master\corerun.exe 1024 69.300 us 0.7271 us 0.6802 us 69.153 us 68.452 us 70.507 us 0.97 0.01
Sort<BigStruct> Array_Comparison \netcore31\corerun.exe 1024 71.688 us 0.3637 us 0.3224 us 71.633 us 71.268 us 72.251 us 1.00 0.00
Sort<Int32> Array_Comparison \master\corerun.exe 1024 48.195 us 0.3197 us 0.2991 us 48.131 us 47.767 us 48.726 us 1.00 0.01
Sort<Int32> Array_Comparison \netcore31\corerun.exe 1024 48.023 us 0.4813 us 0.4266 us 47.822 us 47.539 us 48.785 us 1.00 0.00
Sort<IntClass> Array_Comparison \master\corerun.exe 1024 86.818 us 0.4345 us 0.4064 us 86.950 us 86.163 us 87.711 us 1.03 0.01
Sort<IntClass> Array_Comparison \netcore31\corerun.exe 1024 84.631 us 0.3439 us 0.3048 us 84.565 us 84.260 us 85.095 us 1.00 0.00
Sort<IntStruct> Array_Comparison \master\corerun.exe 1024 49.627 us 0.8809 us 0.7356 us 49.472 us 48.783 us 51.387 us 0.96 0.02
Sort<IntStruct> Array_Comparison \netcore31\corerun.exe 1024 51.501 us 0.6825 us 0.6384 us 51.372 us 50.403 us 52.402 us 1.00 0.00
Sort<String> Array_Comparison \master\corerun.exe 1024 551.786 us 4.9619 us 4.3986 us 550.369 us 546.377 us 561.409 us 0.82 0.01
Sort<String> Array_Comparison \netcore31\corerun.exe 1024 670.227 us 5.4861 us 4.5811 us 669.726 us 662.498 us 680.227 us 1.00 0.00

However, the benchmarks shared in that PR do show a regression in some cases for Int32 on my machine, albeit not quite what was cited:

Method Job Toolchain N Mean[us] Error[us] StdDev[us] Time / N[us] Ratio RatioSD SpeedupMedian Code Size[B]
ArraySort Job-RWQZQA Default 100 2.123 us 0.0490 us 0.0733 us 21.2333 ns 1.13 0.07 0.89 336 B
ArraySort Job-WOYWHX \master\corerun.exe 100 2.050 us 0.0421 us 0.0887 us 20.5000 ns 1.10 0.07 0.91 191 B
ArraySort Job-RWQZQA Default 1000 30.437 us 0.5058 us 0.7091 us 30.4370 ns 1.04 0.02 0.96 336 B
ArraySort Job-WOYWHX \master\corerun.exe 1000 31.331 us 0.2103 us 0.1642 us 31.3306 ns 1.08 0.01 0.93 191 B
ArraySort Job-RWQZQA Default 10000 404.177 us 0.6468 us 0.9277 us 40.4177 ns 1.02 0.00 0.98 336 B
ArraySort Job-WOYWHX \master\corerun.exe 10000 425.490 us 3.3336 us 2.9551 us 42.5490 ns 1.08 0.01 0.93 191 B
ArraySort Job-RWQZQA Default 100000 5,079.667 us 4.6692 us 6.6964 us 50.7967 ns 1.02 0.01 0.98 336 B
ArraySort Job-WOYWHX \master\corerun.exe 100000 5,338.014 us 20.5143 us 16.0162 us 53.3801 ns 1.07 0.01 0.93 191 B
ArraySort Job-RWQZQA Default 1000000 60,104.120 us 32.8821 us 47.1586 us 60.1041 ns 1.01 0.00 0.99 306 B
ArraySort Job-WOYWHX \master\corerun.exe 1000000 63,220.262 us 93.8403 us 87.7783 us 63.2203 ns 1.06 0.00 0.94 97 B
ArraySort Job-RWQZQA Default 10000000 699,312.886 us 755.2464 us 1,107.0303 us 69.9313 ns 1.01 0.00 0.99 306 B
ArraySort Job-WOYWHX \master\corerun.exe 10000000 736,870.178 us 1,700.3041 us 1,590.4655 us 73.6870 ns 1.07 0.00 0.94 150 B

Even so, a simple console app does sufficiently demonstrate a meaningful throughput regression:

using System;
using System.Diagnostics;
using System.Linq;

class Program
{
    static void Main()
    {
        const int Size = 1_000;
        int[][] arrays = Enumerable.Range(0, 100_000).Select(_ => new int[Size]).ToArray();
        var sw = new Stopwatch();

        var r = new Random(42);
        var unsorted = new int[Size];
        for (int i = 0; i < unsorted.Length; i++) unsorted[i] = r.Next();

        while (true)
        {
            foreach (int[] array in arrays) Array.Copy(unsorted, array, unsorted.Length);

            sw.Restart();
            foreach (int[] array in arrays) Array.Sort(array);
            sw.Stop();

            Console.WriteLine(sw.Elapsed.TotalSeconds);
        }
    }
}

On .NET Core 3.1 I get results like:

1.6084611
1.5947359
1.6061765
1.5940196
1.6140805

and with master I get results like:

1.9340624
1.9376759
1.9425619
1.936641
1.9411098

which is closer to a 20% regression.

Since even though our actual benchmarks aren’t showing anything close to that (and in some cases .NET 5 being meaningfully faster, especially with strings), this PR addresses the gap. It includes a variety of tweaks to improve Array.Sort<T>(T[]) performance; the two most impactful are using Unsafe.* in PickPivotAndPartition to avoid bounds checks and aggressive inlining on SwapIfGreater. A few other small improvements to codegen round it out. I only made the unsafe changes in the Sort<T>(T[]) implementation, and not in the more complicated implementations, such as for Sort<T>(T[], Comparer<T>) and Sort<TKey, TValue>(TKey[], TValue[]), but I did make some of the smaller changes for consistency across the file.

Fixes #35175
@jkotas, any visceral reaction to the Unsafe.* code here? 😄
cc: @damageboy, @adamsitnik

In terms of impact, here are my results from my running the benchmarks shared in #35175:

Method Job Toolchain N Mean[us] Error[us] StdDev[us] Time / N[us] Ratio RatioSD SpeedupMedian Code Size[B]
ArraySort Job-RWQZQA Default 100 2.123 us 0.0490 us 0.0733 us 21.2333 ns 1.13 0.07 0.89 336 B
ArraySort Job-WOYWHX \master\corerun.exe 100 2.050 us 0.0421 us 0.0887 us 20.5000 ns 1.10 0.07 0.91 191 B
ArraySort Job-KOZKWL \pr\corerun.exe 100 1.868 us 0.0386 us 0.0933 us 18.6812 ns 1.00 0.00 1.00 191 B
ArraySort Job-RWQZQA Default 1000 30.437 us 0.5058 us 0.7091 us 30.4370 ns 1.04 0.02 0.96 336 B
ArraySort Job-WOYWHX \master\corerun.exe 1000 31.331 us 0.2103 us 0.1642 us 31.3306 ns 1.08 0.01 0.93 191 B
ArraySort Job-KOZKWL \pr\corerun.exe 1000 28.990 us 0.1958 us 0.1635 us 28.9897 ns 1.00 0.00 1.00 191 B
ArraySort Job-RWQZQA Default 10000 404.177 us 0.6468 us 0.9277 us 40.4177 ns 1.02 0.00 0.98 336 B
ArraySort Job-WOYWHX \master\corerun.exe 10000 425.490 us 3.3336 us 2.9551 us 42.5490 ns 1.08 0.01 0.93 191 B
ArraySort Job-KOZKWL \pr\corerun.exe 10000 394.917 us 2.1902 us 1.9415 us 39.4917 ns 1.00 0.00 1.00 191 B
ArraySort Job-RWQZQA Default 100000 5,079.667 us 4.6692 us 6.6964 us 50.7967 ns 1.02 0.01 0.98 336 B
ArraySort Job-WOYWHX \master\corerun.exe 100000 5,338.014 us 20.5143 us 16.0162 us 53.3801 ns 1.07 0.01 0.93 191 B
ArraySort Job-KOZKWL \pr\corerun.exe 100000 4,975.340 us 33.2037 us 31.0588 us 49.7534 ns 1.00 0.00 1.00 191 B
ArraySort Job-RWQZQA Default 1000000 60,104.120 us 32.8821 us 47.1586 us 60.1041 ns 1.01 0.00 0.99 306 B
ArraySort Job-WOYWHX \master\corerun.exe 1000000 63,220.262 us 93.8403 us 87.7783 us 63.2203 ns 1.06 0.00 0.94 97 B
ArraySort Job-KOZKWL \pr\corerun.exe 1000000 59,504.806 us 102.7170 us 80.1947 us 59.5048 ns 1.00 0.00 1.00 97 B
ArraySort Job-RWQZQA Default 10000000 699,312.886 us 755.2464 us 1,107.0303 us 69.9313 ns 1.01 0.00 0.99 306 B
ArraySort Job-WOYWHX \master\corerun.exe 10000000 736,870.178 us 1,700.3041 us 1,590.4655 us 73.6870 ns 1.07 0.00 0.94 150 B
ArraySort Job-KOZKWL \pr\corerun.exe 10000000 691,165.742 us 442.2233 us 413.6559 us 69.1166 ns 1.00 0.00 1.00 150 B

For the simple command-line app, we now get results like:

1.6690784
1.659153
1.6788734
1.6582936
1.6734444

I also tweaked the above to remove the app that copies the unsorted data over each array, such that we're then sorting an already sorted array each time. With that, on .NET Core 3.1 I get:

0.4768374
0.4794121
0.4774386
0.4762099
0.477092

and on master I get:

0.641116
0.6407797
0.6416041
0.6469679
0.6616153

and with this PR I get:

0.4768012
0.4689296
0.4675659
0.470049
0.4719857

Finally, I checked GC pause times similar to what @jkotas did in dotnet/coreclr#27642 (comment). With .NET Core 3.1, we get average GC pause times around 500ms, and with .NET 5 (both master and this PR), we get average GC pause times around 15ms. This highlights one of the benefits of moving the sorting into managed code, separate from all the other benefits.

@stephentoub stephentoub changed the title Improve Array,Sort(array) performance Improve Array.Sort(T[]) performance Apr 22, 2020
A variety of tweaks to improve `Array.Sort<T>(T[])` performance and address a regression left over from moving the array sorting implementation from native to managed.  The two most impactful are using `Unsafe.*` in `PickPivotAndPartition` to avoid bounds checks and aggressive inlining on `SwapIfGreater`.  A few other small improvements to codegen round it out.

I only made the unsafe changes in the `Sort<T>(T[])` implementation, and not in the more complicated implementations, such as for `Sort<T>(T[], Comparer<T>)` and `Sort<TKey, TValue>(TKey[], TValue[])`, but I did make some of the smaller changes for consistency across the file.
@damageboy
Copy link
Contributor

FWIW, I've double checked my side and it is exactly 10% (before these changes), on both my Kaby Lake and AMD Ryzen machine. I will mention that my Intel machine is clean of the Intel JCC microcode update, so that may be a factor in this, depending's on @stephentoub's machine configuration.

@stephentoub
Copy link
Member Author

Thanks. Are you able to test with this PR?

@damageboy
Copy link
Contributor

Issue seems resolved.
Comapred 3.1.201, master without PR, PR branch.
perf now seems better than 3.1.201 in the higher sort problem sizes:

Method Toolchain N Mean [us] Error [us] StdDev [us] Time / N [ns]
ArraySort 3.1.201 100 0.9818 0.0152 0.0213 9.8177
ArraySort master/08285b1 100 1.1140 0.0230 0.0384 11.1398
ArraySort pr/35297/2eef7dd 100 1.0903 0.0215 0.0287 10.9032
ArraySort 3.1.201 1000 16.9130 1.0379 1.5213 16.9130
ArraySort master/08285b1 1000 21.9628 0.4283 0.6143 21.9628
ArraySort pr/35297/2eef7dd 1000 17.2827 0.3448 0.5665 17.2827
ArraySort 3.1.201 10000 440.7663 1.1282 1.5816 44.0766
ArraySort master/08285b1 10000 507.5048 1.4836 1.2388 50.7505
ArraySort pr/35297/2eef7dd 10000 457.2838 9.0489 8.0216 45.7284
ArraySort 3.1.201 100000 5,734.5160 5.6595 8.2957 57.3452
ArraySort master/08285b1 100000 6,474.4154 57.1762 53.4827 64.7442
ArraySort pr/35297/2eef7dd 100000 5,648.4342 64.5477 60.3780 56.4843
ArraySort 3.1.201 1000000 67,353.2352 52.7210 72.1651 67.3532
ArraySort master/08285b1 1000000 77,408.7551 580.1940 542.7138 77.4088
ArraySort pr/35297/2eef7dd 1000000 65,691.5563 173.6212 162.4054 65.6916
ArraySort 3.1.201 10000000 775,940.1604 693.8563 949.7583 77.5940
ArraySort master/08285b1 10000000 898,762.7196 2,351.3387 2,199.4437 89.8763
ArraySort pr/35297/2eef7dd 10000000 749,980.9434 2,472.2225 2,064.4178 74.9981

@stephentoub
Copy link
Member Author

Great. Thanks for confirming!

@stephentoub stephentoub merged commit f73ceee into dotnet:master Apr 24, 2020
@stephentoub stephentoub deleted the arraysortperf branch April 24, 2020 18:12
while (pivot.CompareTo(keys[++left]) > 0) ;
while (pivot.CompareTo(keys[--right]) < 0) ;
while (Unsafe.IsAddressLessThan(ref leftRef, ref nextToLastRef) && pivot.CompareTo(leftRef = ref Unsafe.Add(ref leftRef, 1)) > 0) ;
while (Unsafe.IsAddressGreaterThan(ref rightRef, ref zeroRef) && pivot.CompareTo(rightRef = ref Unsafe.Add(ref rightRef, -1)) < 0) ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub this will no longer result in an index-out-of-range exception thrown when bogus comparable, insteead, it silently swallows that case now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Incorrect comparables wouldn't have always done so, only for some forms of incorrect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand, does that mean you think this is not a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You disagree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought that it would be a breaking change, you will no longer get an exception. Behavior is different from earlier versions. Is it not?

I have no hard feelings about this, clearly my bar for acceptable changes, perf regressions etc. was set too high in some regards 😅 When I worked on this I was aiming for 100% fidelity with existing output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking when I chose to make this change was that the comparer is busted, and there are a multitude of possible behaviors that could result from a busted comparer... it could throw arbitrary exceptions, it could sporadically yield an incorrect comparison in a way no one would know, it could crash the process, etc. We could even be changing behavior in that regard just by changing how many times we invoked the comparer, or by changing the order in which we invoked it on the input data, or any number of other things. So I'm not concerned with taking a failure we may have only sometimes been able to detect (and for which a goal was never true detection) and making it into something which sometimes fails differently, and does so by not throwing instead of throwing.

@jkotas, any concerns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have concerns about the subtle behavior change with broken comparers. My primary concerns around the unsafe code are buffer overruns. It is easy to see that the bounds are checked in this case.

aim for 100% fidelity with existing output.

That was a good place to start and the initial PR that got merged maintained this fidelity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, good to know. Perf/code size will be better without the check 👍 I have tests that will fail due to this change, since I test against built-in Sort, but that is my concern then.

@nietras
Copy link
Contributor

nietras commented Apr 27, 2020

@stephentoub @jkotas guess this means dotnet/corefx#26859 could have been merged anyhow :)

@stephentoub
Copy link
Member Author

guess this means dotnet/corefx#26859 could have been merged anyhow :)

There's a ton more unsafe code in that PR than in this one. The "unsafety" in this PR is scoped to just two functions, with extra scrutiny as called out in dotnet/corefx#26859 (comment).

@nietras
Copy link
Contributor

nietras commented Apr 27, 2020

a ton more unsafe code in that PR than in this one.

True, but mainly due to code having to be repeated so many times, code is the same across versions, if one is safe, is the other not? :) Guess source generators might "solve" this.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Array.Sort perf regression with .NET Core 5.0 preview
9 participants