-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterating with ForEach over ImmutableArray is slower than over Array #780
Comments
/cc @adamsitnik who opened the original issue about all Immutable collections. |
Inlining I am suprised that Where most of the time is spent for the Also, I wonder what perf we would get if |
Thanks for the pointer @adamsitnik. I didn't know I could use PerfView for the Inline Events and was having trouble with the InliningDiagnoser because it emits the output to the screen it was too much info. Looking now, it seems that the JIT is successfully inlining the
I've tried to do some research on what that reason is, but could not find much data. Maybe the JIT is considering that MoveNext and GetCurrent are more important to inline than the single call to GetEnumerator? About the String caseI've tried looking at the ETL traces when I started this investigation, but for some reason I'm not able to see anything inside the benchmark method itself, because it's grayed out in PerfView and can't expand further. This is what I get in the call stacks. Even if I load all symbols for the modules that are with The JIT Events for that case show the same pattern than for Int32: the sub-methods seem to be successfully selected for inlining, but at Just use this.array.GetEnumeratorFinally, at some point I did test just forwarding the call to this.array.GetEnumerator, but that didn't seem to be enough to optimize the flow. It also changes the exceptions that the methods can return (because the are some checks in MoveNext and GetCurrent that would not apply anymore) so it would require adjusting some unit tests and maybe a review if that's a breaking change or not. But I still might give that another try at least to use as a base to compare the JIT behavior. |
Got some interesting information after the suggestions. Why inlining GetEnumerator helpsI was curious why a method that is called only once and it's not that heavy would make such a difference in the results. In the VS Profiler, it shows that most of the time is spent on the assignment inside the loop, and the assembly shows the difference in terms of the instructions executed. The assembly snips I've posted earlier show that the loop for The loop in the baseline shows that the address of the array and the loop variable are loaded from the stack and stored back on every iteration. It seems that the overhead of these extra memory access compounds during the loop, which is reasonable even if if using the L1 cache (it also might explain why that case has more cache misses). I'm not sure if this is expected or not, but it seems that the non-inlined function blocks the JIT from seeing that the array can be accessed directly from the registers from earlier in the code and optimize the calls. Once we remove that "barrier" the code gets optimized. Code sizeIt also seems that inlining also helps with the code size, and the reported number of bites for the benchmark is smaller, because the extra memory operations mentioned above get optimized away. Benchmark for StringAfter testing a few variants of the benchmark to understand the differences in the code, I've noticed that the JIT is able to inline In PerfView, the JIT Inlining events report a different reason for not optimizing the String case when my fix is on: before it was due to an "Unprofitable Inlining" and now it reports a "Runtime Dictionary Lookup". Below is the result a similar benchmark that doesn't use a generic class and just the target type directly. The times drop to the same range as the Int32 case. So it seems that the proposed fix would benefit both cases. @adamsitnik is this something that you've encountered before when working with the benchmarks? BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
[Host] : .NET Framework 4.8 (4.8.4075.0), X64 RyuJIT
Job-RUPWLA : .NET Core 5.0.0 (CoreCLR 5.0.19.61901, CoreFX 5.0.19.61901), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Toolchain=CoreRun IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
; System.Collections.IterateForEach_String.ImmutableArray()
sub rsp,28h
xor eax,eax
mov rdx,qword ptr [rcx+0C0h]
mov ecx,dword ptr [rdx+8]
mov r8d,0FFFFFFFFh
jmp M00_L01
M00_L00:
cmp r8d,ecx
jae M00_L02
movsxd rax,r8d
mov rax,qword ptr [rdx+rax*8+10h]
M00_L01:
inc r8d
cmp ecx,r8d
jg M00_L00
add rsp,28h
ret
M00_L02:
call CoreCLR!JIT_RngChkFail
int 3
sbb dword ptr [rcx+rax],eax
add byte ptr [rdx+rax*2],al
add byte ptr [rax],al
add byte ptr [rax],al
add byte ptr [rax],al
add byte ptr [rax],al
add byte ptr [rax-0Ah],bl
and dh,byte ptr [rdi]
???
jg M00_L03
M00_L03:
add byte ptr [rbp+48h],dl
; Total bytes of code 82 |
cc @AndyAyersMS for observations about inlining above |
@hnrqbaggio: thanks for the observations! For the
We might consider boosting the "promotable struct" multiplier somewhat to give this sort of inline an extra nudge in the jit, since we get a lot of benefit out of promotion. I'll put this on my todo list. For the We can sometimes work around this if the method being inlined doesn't actually use the results of the runtime lookup. Let me dig in and see if that's the case here. I am also looking into the inlined int case, seems like we ought to be able to match the array version codegen but can't remove the bounds check. |
In the inlined static int F(int[] a)
{
int r = 0;
for (int i = -1; ++i < a.Length; )
{
r = a[i];
}
return r;
} And without this transformation the jit won't optimize out the bounds check. This is limitation is going to hold for any sort of inlined enumerator because |
As for the "inlined" It turns out that |
So, areas for codegen follow-up:
|
Is the upshot that until those changes, it is worth force inlining it? |
Yes, I think adding forceinline here is reasonable. Seems like whoever wrote the code was expecting inlining to happen... Lines 15 to 20 in bc47321
|
@hnrqbaggi do you want to offer a PR? |
Yes, I should have one ready soon. |
As a follow-up: the promotable struct benefit in the inliner needs to be at least 5.5 (currently 3) for this case to be handled by default. Not surprisingly this has fairly widespread impact. It's hard to be surgical when changing the inlining heuristics. Lots of good diffs, lots of bad diffs.
|
Motivated in part by dotnet#780.
Second follow-up: Have a prototype master...AndyAyersMS:FgOptWhileLoop that This plus the change from #1183 gives identical codegen for the Overall diffs look plausible too; still chasing through the regressions to see what's up, but so far:
|
Also |
This is a spin off issue from https://github.com/dotnet/corefx/issues/36416, to scope down to just the ImmutableArray case.
As mentioned in other issues, the immutability sometimes comes with trade-offs in some operations, but in this case it seems that the extra overhead can be optimized, at least for when it's a collection of value type.
Comparing with Array
Looking at the ASM code generated, the Array version is high inlined, while for the ImmutableArray the JIT is able to inline the loop itself, but not the call to
ImmutableArray<T>.GetEnumerator()
. The method itself is quite simple, but calls to an internal method calledThrowNullRefIfNotInitialized()
to validate that the underlying array is not null.It seems that the extra method causes the collection to have more branch mis-predictions and cache misses than the array case (the cache misses show up in the results when the collection is larger, say 2048 instead of the default 512 elements).
Possible fix
If we change the implementation of GetEnumerator to use MethodImplOptions.AggressiveInlining, the JIT is able to inline the call and the stats of both benchmarks match and the Median for the Int32 case improves by 4x.
Unfortunately, this seems to not be enough for the Reference Type case. When the benchmark runs using String instead of Int32 the results are still the same as before, so it seems that marking the methods as inline is still not sufficient for the JIT to optimize them.
Correctness
The code in
ThrowNullRefIfNotInitialized
is just accessing the underlyingArray.Length
property and relying on that to throw if the object is null.In the optimized version, I can't see the exact same instructions, so would still need to confirm that the optimizer is not discarding that check. However, the tests that validate that GetEnumerator will throw NRE in that condition did pass.
I have the changes above in my fork of the repo, but this is my first contribution so would like to check the feedback on the findings and not send a PR right away. 😁
Baseline
Array
ImmutableArray
With Aggressive Inlining
Array
ImmutableArray
The text was updated successfully, but these errors were encountered: