Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Remove unneeded unconditional jumps #69041

Merged
merged 22 commits into from
Jun 7, 2022
Merged

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented May 9, 2022

fixes #4326

When compilng some simple methods a jmp instruction is generated that ends up jumping to the next instruction which is not needed. The example provided is:
N.B. this specific example no longer replicates on the current main branch due to recent changes in basic block reordering, it's a nice clean example of what can occur though.

public static int Main(string[] args)
{
    return Math.Max(args.Length, 42);
}

This produced the assembly:

G_M22359_IG01:              ;; offset=0000H
						;; size=0 bbWeight=1    PerfScore 0.00
G_M22359_IG02:              ;; offset=0000H
       8B4108               mov      eax, dword ptr [rcx+8]
       83F82A               cmp      eax, 42
       7D07                 jge      SHORT G_M22359_IG04
						;; size=8 bbWeight=1    PerfScore 3.25
G_M22359_IG03:              ;; offset=0008H
       B82A000000           mov      eax, 42
+      EB00                 jmp      SHORT G_M22359_IG04
						;; size=7 bbWeight=0.18 PerfScore 0.41
G_M22359_IG04:              ;; offset=000FH
       C3                   ret      

where you can see that the jmp to SHORT G_M22359_IG04 is not needed because that is the next instruction, we will save 2 bytes and possibly some cpu jump tracking resources by not including the jump.

The jump is added in the emitter phase and cannot be seen before that so the changes i have made are all very late in the compilation process. The jump exists because a basic block between the blocks that create instruction groups 3 and 4 generates no instructions. Emission of instructions is linear so at the time when IG03 is generated there is a jump present from BB03 to BB05 and this causes an unconditional jump to be generated at the end of IG03. When BB04 is processed it generates no instructions because there is no need anything and the process moves on to BB05 (which will generate IG04) but we're left with a weird looking jump.

I have added a flag to the instrDesc which is set only when the jmp has been added as a trailing unconditional jump in CodeGen::genCodeForBBlist(). These marked jumps can be identified once instruction groups have been generated but this must be done before exception handling code is generated because the instruction group lengths need to be correct for the clause length calculations to be correct. I have added a new function emitter:emitRemoveJumpToNextInst which looks through the jump list for any instructions which are marked and attempts to identify if they are:

  1. the last instruction in the instruction group
  2. has a target which is the next instruction group
  3. not the instruction immediately following a call instruction

if these conditions are met the jump is removed from the emitter jump list and the instruction group then the instruction group is flagged as needing a size recalculation and as having an update instruction count. Once this is done the remaining jumps are sized and updated as normal later by emitter::emitJumpDistBind

The second change needs to be done at the codegen level and it involves updating any live ranges which include the instruction which has been removed. I have added a new function genUpdateLiveRangesForTruncatedIGs which is run after emitRemoveJumpToNextInst if that function returns true indicating that jumps have been removed. At that point we know that all instructions that are going to be removed have been and we can update the live ranges. Live range checks need a new count accessor function because they haven't been finalized yet at the point when the count is needed.

Test run locally only fail where they do on master. The spmi replays are clean. The spmi diffs are positive. Local spmi asmdif result show that this generally lower code sizes by a tiny amount everywhere, whether this is a throughput improvement in all cases i don't know but logically it is better to not have the jmp present.

3313 total methods with Code Size differences (3313 improved, 0 regressed), 4 unchanged.
220 total files with Code Size differences (219 improved, 1 regressed), 4 unchanged.
47773 total files with Code Size differences (47773 improved, 0 regressed), 0 unchanged.
472 total files with Code Size differences (472 improved, 0 regressed), 0 unchanged.
811 total files with Code Size differences (810 improved, 1 regressed), 1 unchanged.
1326 total files with Code Size differences (1325 improved, 1 regressed), 1 unchanged.

all regressions are caused by alignment changes.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 9, 2022
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label May 9, 2022
@ghost
Copy link

ghost commented May 9, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

fixes #4326

attempt 2

Author: Wraith2
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@Wraith2
Copy link
Contributor Author

Wraith2 commented May 9, 2022

@dotnet/jit-contrib be gentle 😁 .

@AndyAyersMS
Copy link
Member

@kunalspathak seems like this intersects with your recent work, so suggest you take the lead here.

@Wraith2 I am curious if you pursued leaving the jump isntr dscs in place but magically shrinking them to be zero bytes long; that seems like it might be less complicated overall and less surgery would be needed?

@kunalspathak
Copy link
Member

Thanks for doing this. Few questions:

  • I don't see any asmdiff changes in arm64 which is little surprising. Can you double check why that is the case?
  • The TP has increased 1.5%, have we considered any other cheaper way to do this as Andy points out?
  • It might be tricky to pull this off, but just a thought - Is there a way to implement some portion of the logic in emitJumpDistBind() along with what Andy has suggested of zeroing the jumps and then recognizing them in emitJumpDistBind() without having to add a flag IGF_UPD_ICOUNT?

@jakobbotsch
Copy link
Member

  • The TP has increased 1.5%, have we considered any other cheaper way to do this as Andy points out?

As the warning in the summary states the native toolset was updated between the baseline and diff, so we should get new numbers before we make any conclusions about this. @Wraith2 new baselines have already built after merged jit changes yesterday, can you merge from main and push a new commit?

@Wraith2
Copy link
Contributor Author

Wraith2 commented May 10, 2022

@Wraith2 I am curious if you pursued leaving the jump isntr dscs in place but magically shrinking them to be zero bytes long

I didn't think of trying that. I don't know what would happen if it tired to write out a zero length instruction. Instruction has more properties than simply it's size so I think this is cleaner even though it's harder. At one point while working on it I was getting assertions about unreferenced labels so it I leave them in place it could spring leaks elsewhere. It would remove the requirement for live range fixup.

I don't see any asmdiff changes in arm64 which is little surprising. Can you double check why that is the case?

That's deliberate. The changes are xarch only because I have no ability to run and thus debug on arm hardware. In theory there shouldn't be much to change to make it work on arm apart from uses of instruction size accessors. From what I read the call with following instruction requirement also applies to arm64.

It might be tricky to pull this off, but just a thought - Is there a way to implement some portion of the logic in emitJumpDistBind()

My first attempt to do this change did exactly that. It didn't work because it caused problems with the exception handling clauses. I considered trying to fixup the exception handling (like I've had to do with live ranges) but that is invasive and likely to introduce more bugs than simply moving this into it's own function earlier. The code in emitJumpDistBind is also very complicated and integrating the extra code was difficult and disruptive.

There is a way to linearize the performance of emitRemoveJumpToNextInst but I decided against doing it at the point because I've already sunk many weeks into this and didn't want to spend more time without getting some form of review.

Since you're looking at this @kunalspathak can you keep in mind that it might interact with nop padding for loop alignment? It's all done before padding is added but I don't know if reducing the size of the instruction group might cause the requires-padding flag to need to be added/removed. There is also the future possibility of changing call, jmp pairs into call, nop pairs which could interact weirdly with padding.

@Wraith2
Copy link
Contributor Author

Wraith2 commented May 10, 2022

@Wraith2 new baselines have already built after merged jit changes yesterday, can you merge from main and push a new commit?

Rebased and pushed.

@jakobbotsch
Copy link
Member

The new TP numbers look much more reasonable.

It is mildy concerning that we saw a 1.5% increase before simply by updating the compiler (which one would hope should decrease number of instructions executed). Of course number of instructions does not necessarily translate to cycles, so hopefully that is the case here.

@Wraith2
Copy link
Contributor Author

Wraith2 commented May 10, 2022

Those are better numbers. As I said there is also a way to make at least one part of it faster and I'm intending to refine this if it's a useful set of changes. My worry was that I didn't want to spend more time on it and end up with a huge change set that simply wasn't worth using.

@BruceForstall
Copy link
Member

The jump exists because a basic block between the blocks that create instruction groups 3 and 4 generates no instructions.

Just curious: what is an example of the BB that doesn't generate instructions?

I've always wondered if we should have a simple flow graph cleanup phase just before codegen.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about how complex this is.

src/coreclr/jit/emit.h Outdated Show resolved Hide resolved
src/coreclr/jit/emit.cpp Show resolved Hide resolved
src/coreclr/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/codegencommon.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/codegencommon.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/codegencommon.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/codegencommon.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emit.cpp Outdated Show resolved Hide resolved
@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label May 10, 2022
@BruceForstall
Copy link
Member

@Wraith2 I am curious if you pursued leaving the jump isntr dscs in place but magically shrinking them to be zero bytes long

I agree about considering this. Possibly bash them from jumps to NOPs, if necessary?

@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label May 10, 2022
@Wraith2
Copy link
Contributor Author

Wraith2 commented May 10, 2022

Just curious: what is an example of the BB that doesn't generate instructions?

This is the original example, as noted in the first post this particular one no longer occurs because block ordering has recently changed which eliminates it. jitdump.txt

To quote @AndyAyersMS from the discord discussion about it:

I see what you're saying, BB03 is not technically empty; it contains a mov rax, rax which we decide we don't need to emit. Hence BB02 can fall through to BB04 instead of jumping there.

@Wraith2
Copy link
Contributor Author

Wraith2 commented May 10, 2022

One thing that bothers me is that in the void emitter::emitIns_J(instruction ins, BasicBlock* dst, int instrCount /* = 0 */, bool isJmpAlways /* = false */) sig the name isJmpAlways feels not descriptive enough but I can't think of a better name. Any suggestions?

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Just to summarize my individual comments:

  1. Use emitIsLastInsCall() to determine which jumps are worth removing. That would eliminate the logic of checking if last instruction was a call/align inside emitRemoveJumpToNextInst() method.
  2. Have a dedicated list of emitUncondJumpList that will reduce the number of jumps you will have to iterate in emitRemoveJumpToNextInst().
  3. Process all the jumps in emitUncondJumpList in a single pass inside emitRemoveJumpToNextInst().
  4. During genUpdateLiveRangesForTruncatedIGs(), just iterate the emitUncondJumpList() and get the corresponding IGs that might have modified ICount. (I am not sure of overall concept of live range update and would get review done by @BruceForstall ).
  5. During outputting, make sure that we don't see a jump that branches to the next instruction but was not visited in earlier phases (using some DEBUG only variable or equivalent).

@ghost ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels May 10, 2022
@kunalspathak
Copy link
Member

The code is only active on amd64

Ah, right. Why is it not TARGET_XARCH?

@Wraith2
Copy link
Contributor Author

Wraith2 commented Jun 3, 2022

Ah, right. Why is it not TARGET_XARCH?

My limited capability and resources. I had enough trouble getting it working for x64 that I didn't want to try and include other targets like arm or x86. In theory it should work on x86 if I change that to TARGET_XARCH.

How about we let this CI run to completion and see if the regression you mentioned earlier is still present and if it is address that problem. Once everything looks ready on the x64 side then I'll flip the define and we'll see what happens on x86?

@kunalspathak
Copy link
Member

I'll flip the define and we'll see what happens on x86?

Sounds good.

@kunalspathak
Copy link
Member

I still see the regression in benchmarks. You should be easily able to get jitdump of it using superpmi

Top method regressions (bytes):
          12 (0.15 % of base) : 13400.dasm - Jil.Deserialize.Methods:_ReadISO8601DateWithOffset(System.IO.TextReader,System.Char[]):System.DateTimeOffset

@Wraith2
Copy link
Contributor Author

Wraith2 commented Jun 4, 2022

I just had a look at it and I'm not sure why it's counted as a regression, it seems better overall with the only odd thing being the lack of it printing alignment
13400.zip

Debugging through the disasm code I think the new behaviour is correct and the old was not. Specifically looking at IG05 from the dumps attached the current jit produces this disasm:

12422_IG05:        ; gcrefRegs=00000040 {rsi}, byrefRegs=00000080 {rdi}, byref, isz
       test     r14b, r14b
       jne      G_M12422_IG143
       lea      r15d, [rbx-1]
       mov      r14d, 1
       jmp      SHORT G_M12422_IG07
       align    [0 bytes for IG23]
						;; size=21 bbWeight=0.50 PerfScore 2.00

note that the top line where we see the ig flags there is no align flag so the group is not aligned. I think it was originally going to be aligned so an instruction was added but the ig flag was removed. So in this case it is correct to not print the alignment.

The code responsible for this is:

if (ins == INS_align)
{
sz = sizeof(instrDescAlign);
// IG can be marked as not needing alignment after emitting align instruction
// In such case, skip outputting alignment.
if (ig->endsWithAlignInstr())
{
dst = emitOutputAlign(ig, id, dst);
}
#ifdef DEBUG
else
{
// If the IG is not marked as need alignment, then the code size
// should be zero i.e. no padding needed.
assert(id->idCodeSize() == 0);
}
#endif
break;

Where ig->endsWithAlignInstr() returns true if the align flag is set on the group. For IG05 it isn't, so we don't get the alignment printed and the assertion that the codesize ==0 is true and everything is ok.

Why is the current jit displaying alignment when it shouldn't? I don't know and I don't think I changed/fixed this.


#ifdef DEBUG
if (emitComp->opts.disAsm || emitComp->verbose)
if ((emitComp->opts.disAsm || emitComp->verbose) && !emitInstHasNoCode(id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some debugging of your branch, found out that the addition of !emitInstHasNoCode(id) is the reason why we stopped displaying align instruction in the disassembly and that is the reason we were seeing the 3 bytes difference.

image

You will have to separate out the conditions for jmp and align and use the combination only at common places but use the jmp condition in jump specific code-paths.

Here is something that you might want to consider: kunalspathak@aaa8f70

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. pushed.

If the group isn't marked as align and there zero alignment bytes to be allocated what is the purpose of printing the instruction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retaining or removing of align instructions are heuristic sensitive vs. jmp instructions removal which is correctness sensitive. During perf investigation, it is often useful to see a sensitive loop that was marked for alignment but was later decided to not do it by spotting those align [0 bytes] in disassembly. It turns out useful to spot such cases and adjust the heuristics to make them align. We also add NOP compensation (if needed) until the last IG that needs alignment. By printing the zero size align, it help us to reason out the purpose of compensation instructions added.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Jun 4, 2022

CI with the suggested change has run. The regression is still present but it is now including the alignment. 13400.zip . The sizes of some groups show up in the diffs so I extracted the ig sizes and summed them.

file reported size calculated size reported prolog size
old 7827 7868 65
new 7839 7839 67

So the new file is making some sense but if you use the old total value then the new one looks like a regression.
The prolog group in the old file was the same as the new file, in both it reports the size as 67 bytes.
I'm tempted to say the regression isn't real because the numbers in the base are wrong.

The 32bit diff was also run. I haven't looked into the regressions it contains yet.

@kunalspathak
Copy link
Member

Some of the regressions are expected because removal of jump might force change the alignment decisions and we would add more alignment where previously, we didn't add one.

image

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Some minor comment changes.

@BruceForstall, I think this is ready to be merged. Do you want to take another look? If not, I will merge this once the comments are addressed and CI is green.

src/coreclr/jit/codegenarmarch.cpp Show resolved Hide resolved
src/coreclr/jit/codegenlinear.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/codegenlinear.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitxarch.cpp Show resolved Hide resolved
src/coreclr/jit/emitxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emit.cpp Show resolved Hide resolved
@ghost ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Jun 6, 2022
@kunalspathak
Copy link
Member

Nice diffs on windows/x64:

image

@kunalspathak
Copy link
Member

/azp run runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your efforts.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Jun 7, 2022

Woo! thanks. It's been a learning experience. 😁

@Wraith2 Wraith2 deleted the remove-jmp2 branch June 7, 2022 21:23
shushanhf added a commit to shushanhf/runtime that referenced this pull request Jun 10, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RyuJIT doesn't eliminate a jump to the next instruction
6 participants