JIT: Remove unneeded unconditional jumps #69041

Wraith2 · 2022-05-09T00:38:43Z

When compilng some simple methods a jmp instruction is generated that ends up jumping to the next instruction which is not needed. The example provided is:
N.B. this specific example no longer replicates on the current main branch due to recent changes in basic block reordering, it's a nice clean example of what can occur though.

public static int Main(string[] args)
{
    return Math.Max(args.Length, 42);
}

This produced the assembly:

G_M22359_IG01:              ;; offset=0000H
						;; size=0 bbWeight=1    PerfScore 0.00
G_M22359_IG02:              ;; offset=0000H
       8B4108               mov      eax, dword ptr [rcx+8]
       83F82A               cmp      eax, 42
       7D07                 jge      SHORT G_M22359_IG04
						;; size=8 bbWeight=1    PerfScore 3.25
G_M22359_IG03:              ;; offset=0008H
       B82A000000           mov      eax, 42
+      EB00                 jmp      SHORT G_M22359_IG04
						;; size=7 bbWeight=0.18 PerfScore 0.41
G_M22359_IG04:              ;; offset=000FH
       C3                   ret

where you can see that the jmp to SHORT G_M22359_IG04 is not needed because that is the next instruction, we will save 2 bytes and possibly some cpu jump tracking resources by not including the jump.

The jump is added in the emitter phase and cannot be seen before that so the changes i have made are all very late in the compilation process. The jump exists because a basic block between the blocks that create instruction groups 3 and 4 generates no instructions. Emission of instructions is linear so at the time when IG03 is generated there is a jump present from BB03 to BB05 and this causes an unconditional jump to be generated at the end of IG03. When BB04 is processed it generates no instructions because there is no need anything and the process moves on to BB05 (which will generate IG04) but we're left with a weird looking jump.

I have added a flag to the instrDesc which is set only when the jmp has been added as a trailing unconditional jump in CodeGen::genCodeForBBlist(). These marked jumps can be identified once instruction groups have been generated but this must be done before exception handling code is generated because the instruction group lengths need to be correct for the clause length calculations to be correct. I have added a new function emitter:emitRemoveJumpToNextInst which looks through the jump list for any instructions which are marked and attempts to identify if they are:

the last instruction in the instruction group
has a target which is the next instruction group
not the instruction immediately following a call instruction

if these conditions are met the jump is removed from the emitter jump list and the instruction group then the instruction group is flagged as needing a size recalculation and as having an update instruction count. Once this is done the remaining jumps are sized and updated as normal later by emitter::emitJumpDistBind

The second change needs to be done at the codegen level and it involves updating any live ranges which include the instruction which has been removed. I have added a new function genUpdateLiveRangesForTruncatedIGs which is run after emitRemoveJumpToNextInst if that function returns true indicating that jumps have been removed. At that point we know that all instructions that are going to be removed have been and we can update the live ranges. Live range checks need a new count accessor function because they haven't been finalized yet at the point when the count is needed.

Test run locally only fail where they do on master. The spmi replays are clean. The spmi diffs are positive. Local spmi asmdif result show that this generally lower code sizes by a tiny amount everywhere, whether this is a throughput improvement in all cases i don't know but logically it is better to not have the jmp present.

3313 total methods with Code Size differences (3313 improved, 0 regressed), 4 unchanged.
220 total files with Code Size differences (219 improved, 1 regressed), 4 unchanged.
47773 total files with Code Size differences (47773 improved, 0 regressed), 0 unchanged.
472 total files with Code Size differences (472 improved, 0 regressed), 0 unchanged.
811 total files with Code Size differences (810 improved, 1 regressed), 1 unchanged.
1326 total files with Code Size differences (1325 improved, 1 regressed), 1 unchanged.

all regressions are caused by alignment changes.

ghost · 2022-05-09T00:38:50Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

fixes #4326

attempt 2

Author:	Wraith2
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

Wraith2 · 2022-05-09T12:42:00Z

@dotnet/jit-contrib be gentle 😁 .

AndyAyersMS · 2022-05-10T02:41:48Z

@kunalspathak seems like this intersects with your recent work, so suggest you take the lead here.

@Wraith2 I am curious if you pursued leaving the jump isntr dscs in place but magically shrinking them to be zero bytes long; that seems like it might be less complicated overall and less surgery would be needed?

kunalspathak · 2022-05-10T06:14:57Z

Thanks for doing this. Few questions:

I don't see any asmdiff changes in arm64 which is little surprising. Can you double check why that is the case?
The TP has increased 1.5%, have we considered any other cheaper way to do this as Andy points out?
It might be tricky to pull this off, but just a thought - Is there a way to implement some portion of the logic in emitJumpDistBind() along with what Andy has suggested of zeroing the jumps and then recognizing them in emitJumpDistBind() without having to add a flag IGF_UPD_ICOUNT?

jakobbotsch · 2022-05-10T07:59:58Z

The TP has increased 1.5%, have we considered any other cheaper way to do this as Andy points out?

As the warning in the summary states the native toolset was updated between the baseline and diff, so we should get new numbers before we make any conclusions about this. @Wraith2 new baselines have already built after merged jit changes yesterday, can you merge from main and push a new commit?

Wraith2 · 2022-05-10T08:22:30Z

@Wraith2 I am curious if you pursued leaving the jump isntr dscs in place but magically shrinking them to be zero bytes long

I didn't think of trying that. I don't know what would happen if it tired to write out a zero length instruction. Instruction has more properties than simply it's size so I think this is cleaner even though it's harder. At one point while working on it I was getting assertions about unreferenced labels so it I leave them in place it could spring leaks elsewhere. It would remove the requirement for live range fixup.

I don't see any asmdiff changes in arm64 which is little surprising. Can you double check why that is the case?

That's deliberate. The changes are xarch only because I have no ability to run and thus debug on arm hardware. In theory there shouldn't be much to change to make it work on arm apart from uses of instruction size accessors. From what I read the call with following instruction requirement also applies to arm64.

It might be tricky to pull this off, but just a thought - Is there a way to implement some portion of the logic in emitJumpDistBind()

My first attempt to do this change did exactly that. It didn't work because it caused problems with the exception handling clauses. I considered trying to fixup the exception handling (like I've had to do with live ranges) but that is invasive and likely to introduce more bugs than simply moving this into it's own function earlier. The code in emitJumpDistBind is also very complicated and integrating the extra code was difficult and disruptive.

There is a way to linearize the performance of emitRemoveJumpToNextInst but I decided against doing it at the point because I've already sunk many weeks into this and didn't want to spend more time without getting some form of review.

Since you're looking at this @kunalspathak can you keep in mind that it might interact with nop padding for loop alignment? It's all done before padding is added but I don't know if reducing the size of the instruction group might cause the requires-padding flag to need to be added/removed. There is also the future possibility of changing call, jmp pairs into call, nop pairs which could interact weirdly with padding.

Wraith2 · 2022-05-10T08:28:25Z

@Wraith2 new baselines have already built after merged jit changes yesterday, can you merge from main and push a new commit?

Rebased and pushed.

jakobbotsch · 2022-05-10T11:05:34Z

The new TP numbers look much more reasonable.

It is mildy concerning that we saw a 1.5% increase before simply by updating the compiler (which one would hope should decrease number of instructions executed). Of course number of instructions does not necessarily translate to cycles, so hopefully that is the case here.

Wraith2 · 2022-05-10T11:21:48Z

Those are better numbers. As I said there is also a way to make at least one part of it faster and I'm intending to refine this if it's a useful set of changes. My worry was that I didn't want to spend more time on it and end up with a huge change set that simply wasn't worth using.

BruceForstall · 2022-05-10T17:16:41Z

The jump exists because a basic block between the blocks that create instruction groups 3 and 4 generates no instructions.

Just curious: what is an example of the BB that doesn't generate instructions?

I've always wondered if we should have a simple flow graph cleanup phase just before codegen.

BruceForstall

I'm worried about how complex this is.

src/coreclr/jit/emit.h

src/coreclr/jit/emit.cpp

src/coreclr/jit/codegencommon.cpp

src/coreclr/jit/emit.cpp

src/coreclr/jit/codegencommon.cpp

src/coreclr/jit/emit.cpp

BruceForstall · 2022-05-10T17:46:47Z

@Wraith2 I am curious if you pursued leaving the jump isntr dscs in place but magically shrinking them to be zero bytes long

I agree about considering this. Possibly bash them from jumps to NOPs, if necessary?

Wraith2 · 2022-05-10T19:46:36Z

Just curious: what is an example of the BB that doesn't generate instructions?

This is the original example, as noted in the first post this particular one no longer occurs because block ordering has recently changed which eliminates it. jitdump.txt

To quote @AndyAyersMS from the discord discussion about it:

I see what you're saying, BB03 is not technically empty; it contains a mov rax, rax which we decide we don't need to emit. Hence BB02 can fall through to BB04 instead of jumping there.

src/coreclr/jit/emit.cpp

src/coreclr/jit/codegencommon.cpp

Wraith2 · 2022-05-10T21:29:37Z

One thing that bothers me is that in the void emitter::emitIns_J(instruction ins, BasicBlock* dst, int instrCount /* = 0 */, bool isJmpAlways /* = false */) sig the name isJmpAlways feels not descriptive enough but I can't think of a better name. Any suggestions?

src/coreclr/jit/emit.cpp

kunalspathak

This looks great. Just to summarize my individual comments:

Use emitIsLastInsCall() to determine which jumps are worth removing. That would eliminate the logic of checking if last instruction was a call/align inside emitRemoveJumpToNextInst() method.
Have a dedicated list of emitUncondJumpList that will reduce the number of jumps you will have to iterate in emitRemoveJumpToNextInst().
Process all the jumps in emitUncondJumpList in a single pass inside emitRemoveJumpToNextInst().
During genUpdateLiveRangesForTruncatedIGs(), just iterate the emitUncondJumpList() and get the corresponding IGs that might have modified ICount. (I am not sure of overall concept of live range update and would get review done by @BruceForstall ).
During outputting, make sure that we don't see a jump that branches to the next instruction but was not visited in earlier phases (using some DEBUG only variable or equivalent).

src/coreclr/jit/emit.cpp

kunalspathak · 2022-06-03T21:52:00Z

The code is only active on amd64

Ah, right. Why is it not TARGET_XARCH?

Wraith2 · 2022-06-03T22:15:24Z

Ah, right. Why is it not TARGET_XARCH?

My limited capability and resources. I had enough trouble getting it working for x64 that I didn't want to try and include other targets like arm or x86. In theory it should work on x86 if I change that to TARGET_XARCH.

How about we let this CI run to completion and see if the regression you mentioned earlier is still present and if it is address that problem. Once everything looks ready on the x64 side then I'll flip the define and we'll see what happens on x86?

kunalspathak · 2022-06-03T23:20:07Z

I'll flip the define and we'll see what happens on x86?

Sounds good.

kunalspathak · 2022-06-03T23:21:24Z

I still see the regression in benchmarks. You should be easily able to get jitdump of it using superpmi

Top method regressions (bytes):
          12 (0.15 % of base) : 13400.dasm - Jil.Deserialize.Methods:_ReadISO8601DateWithOffset(System.IO.TextReader,System.Char[]):System.DateTimeOffset

Wraith2 · 2022-06-04T00:01:31Z

I just had a look at it and I'm not sure why it's counted as a regression, it seems better overall with the only odd thing being the lack of it printing alignment
13400.zip

Debugging through the disasm code I think the new behaviour is correct and the old was not. Specifically looking at IG05 from the dumps attached the current jit produces this disasm:

12422_IG05:        ; gcrefRegs=00000040 {rsi}, byrefRegs=00000080 {rdi}, byref, isz
       test     r14b, r14b
       jne      G_M12422_IG143
       lea      r15d, [rbx-1]
       mov      r14d, 1
       jmp      SHORT G_M12422_IG07
       align    [0 bytes for IG23]
						;; size=21 bbWeight=0.50 PerfScore 2.00

note that the top line where we see the ig flags there is no align flag so the group is not aligned. I think it was originally going to be aligned so an instruction was added but the ig flag was removed. So in this case it is correct to not print the alignment.

The code responsible for this is:

runtime/src/coreclr/jit/emitxarch.cpp

Lines 13581 to 13598 in c80a49d

    
                       if (ins == INS_align) 
        
                       { 
        
                           sz = sizeof(instrDescAlign); 
        
                           // IG can be marked as not needing alignment after emitting align instruction 
        
                           // In such case, skip outputting alignment. 
        
                           if (ig->endsWithAlignInstr()) 
        
                           { 
        
                               dst = emitOutputAlign(ig, id, dst); 
        
                           } 
        
           #ifdef DEBUG 
        
                           else 
        
                           { 
        
                               // If the IG is not marked as need alignment, then the code size 
        
                               // should be zero i.e. no padding needed. 
        
                               assert(id->idCodeSize() == 0); 
        
                           } 
        
           #endif 
        
                           break;

Where ig->endsWithAlignInstr() returns true if the align flag is set on the group. For IG05 it isn't, so we don't get the alignment printed and the assertion that the codesize ==0 is true and everything is ok.

Why is the current jit displaying alignment when it shouldn't? I don't know and I don't think I changed/fixed this.

kunalspathak · 2022-06-04T06:03:54Z

src/coreclr/jit/emitxarch.cpp


 #ifdef DEBUG
-    if (emitComp->opts.disAsm || emitComp->verbose)
+    if ((emitComp->opts.disAsm || emitComp->verbose) && !emitInstHasNoCode(id))


After some debugging of your branch, found out that the addition of !emitInstHasNoCode(id) is the reason why we stopped displaying align instruction in the disassembly and that is the reason we were seeing the 3 bytes difference.

You will have to separate out the conditions for jmp and align and use the combination only at common places but use the jmp condition in jump specific code-paths.

Here is something that you might want to consider: kunalspathak@aaa8f70

Ok. pushed.

If the group isn't marked as align and there zero alignment bytes to be allocated what is the purpose of printing the instruction?

Retaining or removing of align instructions are heuristic sensitive vs. jmp instructions removal which is correctness sensitive. During perf investigation, it is often useful to see a sensitive loop that was marked for alignment but was later decided to not do it by spotting those align [0 bytes] in disassembly. It turns out useful to spot such cases and adjust the heuristics to make them align. We also add NOP compensation (if needed) until the last IG that needs alignment. By printing the zero size align, it help us to reason out the purpose of compensation instructions added.

Wraith2 · 2022-06-04T20:35:46Z

CI with the suggested change has run. The regression is still present but it is now including the alignment. 13400.zip . The sizes of some groups show up in the diffs so I extracted the ig sizes and summed them.

file	reported size	calculated size	reported prolog size
old	7827	7868	65
new	7839	7839	67

So the new file is making some sense but if you use the old total value then the new one looks like a regression.
The prolog group in the old file was the same as the new file, in both it reports the size as 67 bytes.
I'm tempted to say the regression isn't real because the numbers in the base are wrong.

The 32bit diff was also run. I haven't looked into the regressions it contains yet.

kunalspathak · 2022-06-06T14:14:30Z

Some of the regressions are expected because removal of jump might force change the alignment decisions and we would add more alignment where previously, we didn't add one.

kunalspathak

nit: Some minor comment changes.

@BruceForstall, I think this is ready to be merged. Do you want to take another look? If not, I will merge this once the comments are addressed and CI is green.

src/coreclr/jit/codegenarmarch.cpp

src/coreclr/jit/codegenlinear.cpp

src/coreclr/jit/emitxarch.cpp

src/coreclr/jit/emit.cpp

kunalspathak · 2022-06-06T18:42:16Z

Nice diffs on windows/x64:

kunalspathak · 2022-06-06T22:03:32Z

/azp run runtime-coreclr jitstress

azure-pipelines · 2022-06-06T22:03:46Z

Azure Pipelines successfully started running 1 pipeline(s).

kunalspathak

LGTM. Thanks for your efforts.

Wraith2 · 2022-06-07T21:23:21Z

Woo! thanks. It's been a learning experience. 😁

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 9, 2022

ghost added the community-contribution Indicates that the PR has been added by a community member label May 9, 2022

jakobbotsch mentioned this pull request May 9, 2022

Formatting jobs failing on Windows after native toolset upgrade #69053

Closed

Wraith2 marked this pull request as ready for review May 9, 2022 12:41

Wraith2 force-pushed the remove-jmp2 branch from 0ff7b07 to 97921fb Compare May 10, 2022 08:27

BruceForstall requested changes May 10, 2022

View reviewed changes

ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label May 10, 2022

ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label May 10, 2022