-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Remove unneeded unconditional jumps #69041
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue Detailsfixes #4326 attempt 2
|
@dotnet/jit-contrib be gentle 😁 . |
@kunalspathak seems like this intersects with your recent work, so suggest you take the lead here. @Wraith2 I am curious if you pursued leaving the jump isntr dscs in place but magically shrinking them to be zero bytes long; that seems like it might be less complicated overall and less surgery would be needed? |
Thanks for doing this. Few questions:
|
As the warning in the summary states the native toolset was updated between the baseline and diff, so we should get new numbers before we make any conclusions about this. @Wraith2 new baselines have already built after merged jit changes yesterday, can you merge from main and push a new commit? |
I didn't think of trying that. I don't know what would happen if it tired to write out a zero length instruction. Instruction has more properties than simply it's size so I think this is cleaner even though it's harder. At one point while working on it I was getting assertions about unreferenced labels so it I leave them in place it could spring leaks elsewhere. It would remove the requirement for live range fixup.
That's deliberate. The changes are xarch only because I have no ability to run and thus debug on arm hardware. In theory there shouldn't be much to change to make it work on arm apart from uses of instruction size accessors. From what I read the call with following instruction requirement also applies to arm64.
My first attempt to do this change did exactly that. It didn't work because it caused problems with the exception handling clauses. I considered trying to fixup the exception handling (like I've had to do with live ranges) but that is invasive and likely to introduce more bugs than simply moving this into it's own function earlier. The code in There is a way to linearize the performance of Since you're looking at this @kunalspathak can you keep in mind that it might interact with nop padding for loop alignment? It's all done before padding is added but I don't know if reducing the size of the instruction group might cause the requires-padding flag to need to be added/removed. There is also the future possibility of changing |
Rebased and pushed. |
The new TP numbers look much more reasonable. It is mildy concerning that we saw a 1.5% increase before simply by updating the compiler (which one would hope should decrease number of instructions executed). Of course number of instructions does not necessarily translate to cycles, so hopefully that is the case here. |
Those are better numbers. As I said there is also a way to make at least one part of it faster and I'm intending to refine this if it's a useful set of changes. My worry was that I didn't want to spend more time on it and end up with a huge change set that simply wasn't worth using. |
Just curious: what is an example of the BB that doesn't generate instructions? I've always wondered if we should have a simple flow graph cleanup phase just before codegen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about how complex this is.
I agree about considering this. Possibly bash them from jumps to NOPs, if necessary? |
This is the original example, as noted in the first post this particular one no longer occurs because block ordering has recently changed which eliminates it. jitdump.txt To quote @AndyAyersMS from the discord discussion about it:
|
One thing that bothers me is that in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Just to summarize my individual comments:
- Use
emitIsLastInsCall()
to determine which jumps are worth removing. That would eliminate the logic of checking if last instruction was a call/align insideemitRemoveJumpToNextInst()
method. - Have a dedicated list of
emitUncondJumpList
that will reduce the number of jumps you will have to iterate inemitRemoveJumpToNextInst()
. - Process all the jumps in
emitUncondJumpList
in a single pass insideemitRemoveJumpToNextInst()
. - During
genUpdateLiveRangesForTruncatedIGs()
, just iterate theemitUncondJumpList()
and get the corresponding IGs that might have modified ICount. (I am not sure of overall concept of live range update and would get review done by @BruceForstall ). - During outputting, make sure that we don't see a jump that branches to the next instruction but was not visited in earlier phases (using some DEBUG only variable or equivalent).
Ah, right. Why is it not |
My limited capability and resources. I had enough trouble getting it working for x64 that I didn't want to try and include other targets like arm or x86. In theory it should work on x86 if I change that to How about we let this CI run to completion and see if the regression you mentioned earlier is still present and if it is address that problem. Once everything looks ready on the x64 side then I'll flip the define and we'll see what happens on x86? |
Sounds good. |
I still see the regression in benchmarks. You should be easily able to get jitdump of it using superpmi
|
I just had a look at it and I'm not sure why it's counted as a regression, it seems better overall with the only odd thing being the lack of it printing alignment Debugging through the disasm code I think the new behaviour is correct and the old was not. Specifically looking at IG05 from the dumps attached the current jit produces this disasm:
note that the top line where we see the ig flags there is no The code responsible for this is: runtime/src/coreclr/jit/emitxarch.cpp Lines 13581 to 13598 in c80a49d
Where Why is the current jit displaying alignment when it shouldn't? I don't know and I don't think I changed/fixed this. |
src/coreclr/jit/emitxarch.cpp
Outdated
|
||
#ifdef DEBUG | ||
if (emitComp->opts.disAsm || emitComp->verbose) | ||
if ((emitComp->opts.disAsm || emitComp->verbose) && !emitInstHasNoCode(id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some debugging of your branch, found out that the addition of !emitInstHasNoCode(id)
is the reason why we stopped displaying align
instruction in the disassembly and that is the reason we were seeing the 3 bytes
difference.
You will have to separate out the conditions for jmp
and align
and use the combination only at common places but use the jmp
condition in jump specific code-paths.
Here is something that you might want to consider: kunalspathak@aaa8f70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. pushed.
If the group isn't marked as align and there zero alignment bytes to be allocated what is the purpose of printing the instruction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retaining or removing of align
instructions are heuristic sensitive vs. jmp
instructions removal which is correctness sensitive. During perf investigation, it is often useful to see a sensitive loop that was marked for alignment but was later decided to not do it by spotting those align [0 bytes]
in disassembly. It turns out useful to spot such cases and adjust the heuristics to make them align. We also add NOP
compensation (if needed) until the last IG that needs alignment. By printing the zero size align
, it help us to reason out the purpose of compensation instructions added.
CI with the suggested change has run. The regression is still present but it is now including the alignment. 13400.zip . The sizes of some groups show up in the diffs so I extracted the ig sizes and summed them.
So the new file is making some sense but if you use the old total value then the new one looks like a regression. The 32bit diff was also run. I haven't looked into the regressions it contains yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Some minor comment changes.
@BruceForstall, I think this is ready to be merged. Do you want to take another look? If not, I will merge this once the comments are addressed and CI is green.
/azp run runtime-coreclr jitstress |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your efforts.
Woo! thanks. It's been a learning experience. 😁 |
fixes #4326
When compilng some simple methods a jmp instruction is generated that ends up jumping to the next instruction which is not needed. The example provided is:
N.B. this specific example no longer replicates on the current main branch due to recent changes in basic block reordering, it's a nice clean example of what can occur though.
This produced the assembly:
G_M22359_IG01: ;; offset=0000H ;; size=0 bbWeight=1 PerfScore 0.00 G_M22359_IG02: ;; offset=0000H 8B4108 mov eax, dword ptr [rcx+8] 83F82A cmp eax, 42 7D07 jge SHORT G_M22359_IG04 ;; size=8 bbWeight=1 PerfScore 3.25 G_M22359_IG03: ;; offset=0008H B82A000000 mov eax, 42 + EB00 jmp SHORT G_M22359_IG04 ;; size=7 bbWeight=0.18 PerfScore 0.41 G_M22359_IG04: ;; offset=000FH C3 ret
where you can see that the jmp to
SHORT G_M22359_IG04
is not needed because that is the next instruction, we will save 2 bytes and possibly some cpu jump tracking resources by not including the jump.The jump is added in the emitter phase and cannot be seen before that so the changes i have made are all very late in the compilation process. The jump exists because a basic block between the blocks that create instruction groups 3 and 4 generates no instructions. Emission of instructions is linear so at the time when IG03 is generated there is a jump present from BB03 to BB05 and this causes an unconditional jump to be generated at the end of IG03. When BB04 is processed it generates no instructions because there is no need anything and the process moves on to BB05 (which will generate IG04) but we're left with a weird looking jump.
I have added a flag to the
instrDesc
which is set only when the jmp has been added as a trailing unconditional jump inCodeGen::genCodeForBBlist()
. These marked jumps can be identified once instruction groups have been generated but this must be done before exception handling code is generated because the instruction group lengths need to be correct for the clause length calculations to be correct. I have added a new functionemitter:emitRemoveJumpToNextInst
which looks through the jump list for any instructions which are marked and attempts to identify if they are:call
instructionif these conditions are met the jump is removed from the emitter jump list and the instruction group then the instruction group is flagged as needing a size recalculation and as having an update instruction count. Once this is done the remaining jumps are sized and updated as normal later by
emitter::emitJumpDistBind
The second change needs to be done at the codegen level and it involves updating any live ranges which include the instruction which has been removed. I have added a new function
genUpdateLiveRangesForTruncatedIGs
which is run afteremitRemoveJumpToNextInst
if that function returns true indicating that jumps have been removed. At that point we know that all instructions that are going to be removed have been and we can update the live ranges. Live range checks need a new count accessor function because they haven't been finalized yet at the point when the count is needed.Test run locally only fail where they do on master. The spmi replays are clean. The spmi diffs are positive. Local spmi asmdif result show that this generally lower code sizes by a tiny amount everywhere, whether this is a throughput improvement in all cases i don't know but logically it is better to not have the jmp present.
all regressions are caused by alignment changes.