-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fast path for linq count with predicate #102884
Conversation
@EgorBot -arm64 -intel using BenchmarkDotNet.Attributes;
using System.Buffers;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<Perf_Count>(args: args);
[MemoryDiagnoser]
[DisassemblyDiagnoser(maxDepth: 2)]
public class Perf_Count
{
[Params(1, 5, 10, 100, 1000)]
public int Length;
int[]? _array;
IEnumerable<int>? _select;
[GlobalSetup]
public void Setup()
{
_array = Enumerable.Range(0, Length).ToArray();
_select = _array.Select(i => i);
}
[Benchmark]
public int Array() => _array!.Count(i => i % 2 == 0);
[Benchmark]
public int Select() => _select!.Count(i => i % 2 == 0);
} |
Benchmark results on Intel
|
Benchmark results on Arm64
|
@EgorBot -arm64 -intel using BenchmarkDotNet.Attributes;
using System.Buffers;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<Perf_Count>(args: args);
[MemoryDiagnoser]
[DisassemblyDiagnoser(maxDepth: 4)]
public class Perf_Count
{
[Params(1, 5, 10, 100, 10_000)]
public int Length;
int[]? _array;
IEnumerable<int>? _select;
[GlobalSetup]
public void Setup()
{
_array = Enumerable.Range(0, Length).ToArray();
_select = _array.Select(i => i);
}
[Benchmark]
public int Array() => _array!.Count(i => i % 2 == 0);
[Benchmark]
public int Select() => _select!.Count(i => i % 2 == 0);
} |
This split adds code bloat. We do not have an artificial split like this in other places that use |
Benchmark results on Intel
|
Benchmark results on Arm64
|
The goal behind this is to allow the JIT to inline the This also allows the JIT to optimize away the precondition checks when possible (almost noise given enumerator allocation but hey). Second variant which moves the span loop to a local function shaves off ~100B of codegen impact size on enumerable path. For the span path, this PR is codegen size win (specific instance but yeah, not total size). |
@EgorBot -arm64 -intel using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<Perf_Count>(args: args);
[MemoryDiagnoser]
[DisassemblyDiagnoser(maxDepth: 4)]
public class Perf_Count
{
[Params(1, 5, 10, 100, 10_000)]
public int Length;
int[]? _array;
IEnumerable<int>? _select;
[GlobalSetup]
public void Setup()
{
_array = Enumerable.Range(0, Length).ToArray();
_select = _array.Select(i => i);
}
[Benchmark]
public int Array() => _array!.Count(i => i % 2 == 0);
[Benchmark]
public int ArrayTwoDelegates()
{
var array = _array!;
return array.Count(i => i % 2 == 0) +
array.Count(i => i % 8 == 0);
}
[Benchmark]
public int Select() => _select!.Count(i => i % 2 == 0);
[Benchmark]
public int SelectTwoDelegates()
{
var select = _select!;
return select.Count(i => i % 2 == 0) +
select.Count(i => i % 8 == 0);
}
} |
❌ Benchmark failed on Intel
|
Benchmark results on Arm64
|
Sadly, it seems to be some general unrelated BDN issue - it just dies sometimes (I think Stephen hit it too once). Looks like it's |
@jkotas After reviewing ARM64 results and their disassembly, it appears that my assumption on per-callsite delegate inlining was proven wrong - I think JIT gathers the profile in the instrumented Count/CountSpan compilations before they get inlined, causing only a single instance of the delegate to get inlined in both loop bodies (winning non-deterministically?): M00_L02: ;; first loop start
ldr w1, [x23, w22, uxtw #2]
ldr x0, [x21, #0x18]
add x2, x26, #0x18
cmp x0, x2
b.ne M00_L03
tst w1, #7 ;; i % 8 is 0 instead of i % 2 is 0
b.ne M00_L05
b M00_L04
M00_L03:
ldr x0, [x21, #8]
ldr x2, [x21, #0x18]
blr x2
cbz w0, M00_L05
M00_L04:
add w25, w25, #1
M00_L05:
add w22, w22, #1
cmp w22, w24
b.lt M00_L02
M00_L06:
ldr x22, [x20, #8]
cbz x22, M00_L18
M00_L07:
cbz x22, M00_L24
movz w23, #0x1
ldr x1, [x19]
movz x0, #0x61d0
movk x0, #0x8d05, lsl #16
movk x0, #0xe8f5, lsl #32
cmp x1, x0
b.ne M00_L19
add x20, x19, #0x10
ldr w21, [x19, #8]
M00_L08:
cbz w23, M00_L23
mov w24, wzr
mov w19, wzr
cmp w21, #0
b.le M00_L13
movz x26, #0xb360
movk x26, #0x8d61, lsl #16
movk x26, #0xe8f5, lsl #32
M00_L09:
ldr w1, [x20, w19, uxtw #2]
ldr x0, [x22, #0x18]
add x2, x26, #0x18
cmp x0, x2
b.ne M00_L10
tst w1, #7 ;; same i % 8 is 0, as expected
b.ne M00_L12
b M00_L11
M00_L10:
ldr x0, [x22, #8]
ldr x2, [x22, #0x18]
blr x2
cbz w0, M00_L12
M00_L11:
add w24, w24, #1
M00_L12:
add w19, w19, #1
cmp w19, w21
b.lt M00_L09 Given the above, please let me know what you think is the appropriate course of action:
|
My preference is to keep all Linq code consistent. |
Personally, I think all top-level LINQ methods that have TryGetSpan + EH from IEnumerable path in them are making a suboptimal implementation choice but that's not for me to decide. Either way, thanks! |
@jkotas do you mind if I update this PR to include a similar optimization for |
As noted in #102696 (comment), main...stephentoub:runtime:fasterlinqenumeration highlights that it's not just Count/Any/All, but many more APIs that would need to be updated to be made consistent. I've not submitted a PR for that yet though because of the regressions that it brings in addition to the improvements. Most of those were from special-casing IList as well, but even for the checks for array/list that the span checks incur, it adds some overhead that will show up if these APIs are used on non-arrays/lists frequently. |
Fair enough, but on the other hand this change (particularly the variant that moves EH blocking inlining to local function) replaces the pattern that is difficult for JIT to optimize to the one that is easier to optimize and that has higher likelihood of improving (I find it surprising it still does not optimize away arr.GetType() != typeof(int[]) given it's a direct scalar comparison). It pays with about two branches for setup to iterate directly on the span. It's cheaper than |
Please let me know what are the next steps with this change if any, thanks! |
I added in use of TryGetSpan to the other relevant sinks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Fixes #102696
A simple change to take fast path when the source is span-able. Also move non-span loop to a local function to make it easier for the JIT to inline the method alongside the lambda, when available, per callsite, avoiding a single lambda only devirtualization due to the loop not being inlined.