-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove mono specific SpanHelpers #78015
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue Details#73768 did various changes that hurt span performance on mono. The change was reverted afterwards on mono by restoring old code in #75917. This PR removes mono specific code and solves performance problems via small tweaks within non vectorized code inside SpanHelpers, as well as by adding a couple of optimizations to mono interpreter.
|
can this backport to release/7.0? |
There is no need to backport, net7 has a workaround for the perf problems. |
goto Found6; | ||
if (uValue == Unsafe.AddByteOffset(ref searchSpace, offset + 7)) | ||
if (uValue == Unsafe.AddByteOffset(ref current, 7)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these changes actually help? On coreclr at least it's not clear to me just from the diff that it's a net win.
https://www.diffchecker.com/du4Cevvb
Can you share throughput numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be any improvement on coreclr. I wouldn't be surprised if the generated code is more or less identical, since the JIT might make use of common subexpression elimination optimization. Also the code paths tweaked in this PR are for non vectorized cases, which are mostly not hit at all with coreclr.
The BCL change is intended for less optimized compilers to be able to generate good quality code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be any improvement on coreclr.
I'm more concerned it might regress. At least according to the diff I linked above, the new asm is a bit longer and appears to use a different addressing mode. I don't have a good sense for what that might mean in terms of throughput.
Also the code paths tweaked in this PR are for non vectorized cases, which are mostly not hit at all with coreclr.
Aren't they used for shorter inputs that don't have enough data to fill a vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local perf numbers on some benchmarks are unclear, I'm waiting for a full performance run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SpanHelper
changes look OK, but I would like to see benchmark numbers that prove that we really need these goto
s.
Thank you for fixing the main codegen issue @BrzVlad ! (Please get another reviewer for this part).
|
||
offset += 1; | ||
} | ||
return -1; | ||
Found7: | ||
return (int)(offset + 7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need these gotos? One of the primary goals of #73768 was to simplify the code and get rid of the gotos (I am writing guidelines for how to use Vector12/256
and my job was to verify that users can write simple but performant code using the new APIs). What difference do the benchmarks report when using goto? If the difference is worth it, could you please log a codegen issue that would allow us to get rid of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamsitnik Consider for example this very simple benchmark: https://gist.github.com/BrzVlad/59ab5dea7b8dbbd632877f66882c1c6e
On both mono jit and mono interp the goto version is significantly faster (about 35%). Here is the diff on mono interp https://www.diffchecker.com/hsvgFTPt. As described in the commit, the generated code for the hot loop with the goto is more compact and with less branches so it is faster.
On coreclr there is no diference in performance, which I expected. Looking at the generated code is a little bit creepy since the the codegen for both methods is absolutely identical https://www.diffchecker.com/GhauFh1f.
goto Found1; | ||
if (value.Equals(Unsafe.Add(ref searchSpace, length))) | ||
if (value.Equals(current)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume there is a difference between CLR and MonoVM which required you to apply this pattern to all the helper methods here. Do we have an issue that tracks it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reported issue for some of these changes. I applied the same pattern simply for consistency with the rest of the code. And this would also result in some perf gains on mono in some other ares as well.
The interpreter changes look ok to me. |
d254eac
to
9e17559
Compare
…ssions (dotnet#75917)" This reverts commit 254844a.
Before, for every single element, the address `address + offset + ct` had to be computed. In theory, a smart compiler would be able to reuse `address + offset` value and offset it only by a constant to obtain every single element. Do this explicitly to avoid reliance on advanced optimizations.
This would replace code like ``` load b.neq add ret load b.neq add ret load .... ``` with ``` load b.eq load b.eq load ... ``` This makes the code more compact in the hot loop, reduces overall code size and thus improves performance. This pattern is widely used and it was also used before with Span lookups.
Before we were marking bblocks as dead if they had their in_count 0. This is not enough however, since it doesn't account for loops. We now do a full traversal of the bblock graph to detect unreachable bblocks.
Consider for example the following pattern used commonly with conditional branches: ``` br.s [nil <- nil], BB0 ... ceq0.i4 [32 <- 40], br.s [nil <- nil], BB1 BB0: ldc.i4.0 [32 <- nil], BB1: brfalse.i4.s [nil <- 32], BB_EXIT BB2: ldstr [56 <- nil], 2 ``` This commit reorders this code to look like: ``` br.s [nil <- nil], BB0 ... ceq0.i4 [32 <- 40], brfalse.i4.s [nil <- 32], BB_EXIT br.s [nil <- nil], BB2 BB0 ldc.i4.0 [32 <- nil], BB1: brfalse.i4.s [nil <- 32], BB_EXIT BB2: ldstr [56 <- nil], 2 ``` This means we will have duplicated brfalse instructions, but every basic block reaching the conditional branch will have information about the condition. For example ceq0.i4 + brfalse is equivalent to brtrue, ldc.i4.0 + brfalse is equivalent to unconditional branch. After other future optimizations applied on the bblocks graph, like removal, merging and propagation of target, the resulting code in this example would look like: ``` br.s [nil <- nil], BB_EXIT ... brtrue.i4.s [nil <- 40], BB_EXIT BB2: ldstr [56 <- nil], 2 ``` Which is a great simplification over the original code.
… targets Even though they can be become unreachable in the current method, they can still be called when the unoptimized method gets tiered at this point. Add assert to prevent such issues in the future.
If we are unlikely to gain anything from propagating the condition (if we don't have information about any of the condition operand vars), simply avoid the optimization.
If we store in a var and this var is not used and redefined by the end of the basic block, then we can clear the original store.
We detect if a var's value never escapes the definition of a bblock. We mark such vars and clear unused definitions of that var from other bblocks.
If a bblock contains only an unconditional br, then all bblocks branching into it can just call the target directly instead.
9e17559
to
954483a
Compare
The change to the indexing approach in span helpers hot loops has pretty random perf consequences on CoreCLR. In order to reduce side effects of this PR, which has the main purpose of fixing performance of these helpers with mono interpreter, I removed the indexing change and added a few more powerful instructions to the interpreter to compensate for this. Closing this in favor of #79215 |
#73768 did various changes that hurt span performance on mono. The change was reverted afterwards on mono by restoring old code in #75917.
This PR removes mono specific code and solves performance problems via small tweaks within non vectorized code inside SpanHelpers, as well as by adding a couple of optimizations to mono interpreter.