-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Outdated] Implementation of CSE for GT_CNS_INT benefits ARM64. #36831
Conversation
|
src/coreclr/src/jit/gentree.cpp
Outdated
} | ||
else | ||
{ | ||
// Arm64 allows any arbitrary 16-bit constant to be loaded into a register halfword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HWIntrinsics recently got some support in lowering where we will generate better code if all inputs to a NI_Vector128_Create
call are constant.
Is there any handling that should be added here to help account for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because the Lower phase runs after the CSE phase
e192f27
to
61527dd
Compare
case GT_CNS_STR: | ||
// Uses movw/movt | ||
costSz = 7; | ||
costEx = 3; | ||
goto COMMON_CNS; | ||
|
||
case GT_CNS_LNG: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why this is needed since the PR is for Arm64? Isn't GT_CNS_LNG
never created for 64-bit JIT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I worded this poorly. I meant to say the PR title is implying this is just Arm64 work and should likely be updated if its also touching arm32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It current;y enables CSE's of constants across all platforms.
On ARM64 we get the largest benefits..
37bc9cc
to
e070a73
Compare
@@ -2705,6 +2705,8 @@ void Compiler::fgInitArgInfo(GenTreeCall* call) | |||
size_t addrValue = (size_t)call->gtEntryPoint.addr; | |||
GenTree* indirectCellAddress = gtNewIconHandleNode(addrValue, GTF_ICON_FTN_ADDR); | |||
indirectCellAddress->SetRegNum(REG_R2R_INDIRECT_PARAM); | |||
// Don't attempt to CSE this constant | |||
indirectCellAddress->SetDoNotCSE(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is interesting that we are not CSEing indirect cell addresses.
We can perhaps CSE the following code pattern:
...
9000000B adrp x11, [RELOC #0x1de756537c0]
9100016B add x11, x11, #0
F9400160 ldr x0, [x11]
D63F0000 blr x0
...
9000000B adrp x11, [RELOC #0x1de756537c0]
9100016B add x11, x11, #0
F9400160 ldr x0, [x11]
D63F0000 blr x0
AA0003EF mov x15, x0
...
...
9000000B adrp x11, [RELOC #0x1de756537c0]
9100016B add x11, x11, #0
F9400160 ldr x0, [x11]
mov xR, x0 ; store x0 in some register xR
D63F0000 blr x0
...
mov x0, xR ; retrieve xR into x0
D63F0000 blr x0
AA0003EF mov x15, x0
...
With this, we can get an improvement of 8 bytes + 1 elimination of memory access.
I wrote an analyzer asm to find out how many addresses are CSE candidates and the number is huge. From what I noticed, it would by little over 2MB of size reduction.
Processed 191816 methods. Found 29246 methods containing 259123 groups.
Details: cse-candidates.txt
Update: Certainly not for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AnalyzeAsm code that I used to find out CSE candidates: dotnet/jitutils#273
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this limitation because it was causing an assert failure in LSRA.
I will re-check this to see if it still happens and add a comment if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indirection of r11 is never materialized in our current IR:
so we can't create a CSE for these indirections
Generating: N011 ( 2, 8) [000039] -------N---- t39 = CNS_INT(h) long 0xd1ffab1e ftn REG x11 $140
IN0002: adrp x11, [RELOC #0xd1ffab1e]
IN0003: add x11, x11, #0
/--* t39 long
Generating: N013 (???,???) [000144] ------------ t144 = * PUTARG_REG long REG x11
/--* t144 long arg0 in x11
Generating: N015 ( 16, 11) [000005] --C--------- t5 = * CALL help r2r_ind ref HELPER.CORINFO_HELP_READYTORUN_NEW REG x0 $1c0
IN0004: ldr x0, [x11]
Call: GCvars=0000000000000000 {}, gcrefRegs=180000 {x19 x20}, byrefRegs=0000 {}
IN0005: blr x0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CSE of Constants are enable for arm32 we will hit an assert when we CSE this address:
(So I will add an #ifdef to prevent this on Arm32)
Assertion failed 'candidates != candidateBit' in 'System.Array:ConvertAll(System.__Canon[],System.Converter
2[__Canon,__Canon]):System.__Canon[]' during 'LSRA build intervals' (IL size 64)`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Is there a follow-up issue to investigate this ARM failure?
should we mark this also as Refers to: src/coreclr/src/jit/morph.cpp:8397 in 99b1da8. [](commit_id = 99b1da839ecc7903e531417c8cc910580054101e, deletion_comment = False) |
Sample improvement in BenchmarksGame.RegexRedux_5:
Here is the new Jitted code generated for this method:
|
// Default 0, enable the CSE of Constants, including nearby offsets. (only for ARM64) | ||
// If 1, disable all the CSE of Constants | ||
// If 2, enable the CSE of Constants but don't combine with nearby offsets. (only for ARM64) | ||
// If 3, enable the CSE of Constants including nearby offsets. (all platforms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(all platforms) [](start = 62, length = 16)
just for completion, add (all platforms)
. I didn't check at other places yet, but we will not have "CSE + nearby offset" optimization for other platforms by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have disabled CSE of constants by default for non-ARM64 platforms.
see optcse.cpp lines 726-742
bool disableConstCSE = false;
int configValue = JitConfig.JitDisableConstCSE();
// all platforms - disable CSE of constant values when config is 1
if (configValue == 1)
{
disableConstCSE = true;
}
#if !defined(TARGET_ARM64)
// non-ARM64 platforms - also disable CSE of constant values when config is 0 or 2
if ((configValue == 0) || (configValue == 2))
{
disableConstCSE = true;
}
#endif
That looks awesome @briansull , great improvement!. Unrelated question to this, is |
#endif // TARGET_ARM64 | ||
|
||
// All Platforms - also allow to combine with nearby offsets, when config is 3 | ||
if (configValue == 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configValue [](start = 8, length = 11)
are you considering adding a JITStress pipeline or something equivalent to test this config value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't planning on adding that. I want to be able to experiment on enabling this for the other platforms.
@TamarChristinaArm Yes CORINFO_HELP_ARRADDR_ST has no return value, but it follows the normal calling convention, which kills the incoming argument values in x0, w1 and x2. So they have to be reloaded at each call site. |
Ugh.. sorry I should stop responding near bed-time :) it's a argument register, no guarantee it would ever be live after a call. sorry for the noise :) |
cf42eab
to
5092e9c
Compare
@dotnet/jit-contrib PTAL |
Test including JitStress=1 are all passing, |
Will rebase to resolve conflicts, |
…GT_CNS_STR ARM64 -11528 : System.Private.CoreLib.dasm (-0.10% of base) 252 total methods with Code Size differences (222 improved, 30 regressed), 20696 unchanged. X64 886 : System.Private.CoreLib.dasm (0.03% of base) 229 total methods with Code Size differences (16 improved, 213 regressed), 20918 unchanged.
Fixes 4 x86 test failures
…ested loop We also will record that the block has a call if the duplicated code contains a call.
… specific register, so clear the RegNum if it was set in the original expression.
Add support for COMPLUS_JitDisableConstCSE
…of every method in the output file
…values that differ only in the low 12 bits.
Const CSE may create an assignment node for the target of a calli: indirect call
…al GT_MUL when fgGlobalMorph is true
More fixes for GT_CNS_INT costs
…he definition, sets csdConstDefValue and csdConstDefVN Implemented shared consts CSE's with a 16-bit offset for x64, can be enabled using COMPLUS_JitDisableConstCSE=3
Update JitDisableConstCSE to support various options 1-5
…tions Coverage for Floating point CSEs, conditional branch AssertionProp and NaN handling
f149ef2
to
435bdd5
Compare
@dotnet/jit-contrib PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid using magic because it is very hard to read and understand.
Nit: I think the review could be faster if you create several PRs put of this: with magic division, correctness fixes and CSE optimizations.
// Also set the value number on the relop. | ||
if (curAssertion->assertionKind == OAK_EQUAL) | ||
// set foldResult to either 0 or 1 | ||
bool foldResult = assertionKindIsEqual; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a correctness fix? Were we getting wrong VN pair for !=
in the past?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is incorrectly assigning the wrong constant value number. I found similar code
It was occasionally happening in the framework dlls and the the beq_r8 test cases.
Our code has two chances to fold the expression, the first one will fold compares with two constant arguments,
which is the common case. If that doesn't fold then it tries folding using the value numbers and that path was wrong and caused the beq_r8 tests to fail with my CSE changes.
@@ -4947,6 +4955,12 @@ GenTree* Compiler::optVNConstantPropOnJTrue(BasicBlock* block, GenTree* test) | |||
// | |||
Compiler::fgWalkResult Compiler::optVNConstantPropCurStmt(BasicBlock* block, Statement* stmt, GenTree* tree) | |||
{ | |||
// Don't perform const prop on expressions marked with GTF_DONT_CSE | |||
if (!tree->CanCSE()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was it unprofitable or incorrect to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a constant prop here would replace the CSE LclVar with the original constant.
Essentially undoing the CSE of the constant.
@@ -6353,6 +6356,19 @@ class Compiler | |||
// number, this will reflect it; otherwise, NoVN. | |||
}; | |||
|
|||
#if defined(TARGET_XARCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this block live in target.h?
@@ -2853,6 +2861,19 @@ struct GenTreeOp : public GenTreeUnOp | |||
assert(oper == GT_NOP || oper == GT_RETURN || oper == GT_RETFILT || OperIsBlk(oper)); | |||
} | |||
|
|||
// returns true if we will use MagicNumber multiplication for this node. | |||
bool usesMagicNumberDivision(Compiler* comp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we prefer to start function with an upper case unless we create them with a prefix or in an old class that does not follow the current cc, so UsesMagicNumberDivision/CheckMagicNumberDivision
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done as part of #39021
@@ -922,6 +928,8 @@ struct GenTree | |||
#define GTF_OVERFLOW 0x10000000 // Supported for: GT_ADD, GT_SUB, GT_MUL and GT_CAST. | |||
// Requires an overflow check. Use gtOverflow(Ex)() to check this flag. | |||
|
|||
#define GTF_DIV_USE_MAGIC 0x80000000 // GT_DIV -- Uses MagicNumber multiplication to compute this division by a constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am strongly against introducing any magic to our source base, Use_Reciprocal_Mul_Or_Shift
or something similar to old DIV_BY_CONST\DIV_USE_NO_DIV
looks better in the long term for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are commits about this magic division related to implementation of CSE and should be in this PR or could be extracted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this code to prevent regressions. It is a big win to perform a division by a constant using the reciprocal multiplication. There were cases where we would CSE the constant and replace it with a CSE local variable, thus preventing the optimization. I didn't invent the term Magic number division as it was already is use in our code base. I will try to come up with a new term to use as the GTF flag.
I could extract the part of the change and check it in separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I factored out all of the prerequisite code changes for this into a new PR
This PR has been replaced by #39096 |
Can this be closed then? |
No description provided.