Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RyuJIT] Emit shlx, sarx, shrx on x64 #67182

Merged
merged 15 commits into from
May 20, 2022
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions src/coreclr/jit/codegenxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4403,6 +4403,34 @@ void CodeGen::genCodeForShift(GenTree* tree)
inst_RV_SH(ins, size, tree->GetRegNum(), shiftByValue);
}
}
#if defined(TARGET_64BIT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here...why is it only for x64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling for x64 only for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling for x64 only for now.

else if (tree->OperIsShift() && compiler->compOpportunisticallyDependsOn(InstructionSet_BMI2))
{
// Try to emit shlx, sarx, shrx if BMI2 is available instead of mov+shl, mov+sar, mov+shr.
switch (tree->OperGet())
{
case GT_LSH:
ins = INS_shlx;
break;

case GT_RSH:
ins = INS_sarx;
break;

case GT_RSZ:
ins = INS_shrx;
break;

default:
unreached();
}

regNumber shiftByReg = shiftBy->GetRegNum();
emitAttr size = emitTypeSize(tree);
// The order of operandReg and shiftByReg are swapped to follow shlx, sarx and shrx encoding spec.
GetEmitter()->emitIns_R_R_R(ins, size, tree->GetRegNum(), shiftByReg, operandReg);
Copy link
Member

@tannergooding tannergooding Mar 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth a note that we don't currently support the contained form due to more complex changes being needed in the emitter.

Ideally we'd fix up the emitter and use inst_RV_RV_TT instead so that we can emit shlx r32a, r/m32, r32b. Someone would need to walk through the relevant IF_RWR_RRD_*RD formats and ensure that it's all handled correctly (noting that technically the format is IF_RWR_*RD_RRD but that should be the same as IF_RWR_RRD_*RD with swapping op1/op2, like we do for a couple other BMI2 instructions, namely bextr and bzhi).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #67314.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @tannergooding , lets have a comment about containment and also specify that here, the operands are swapped because the way operands are encoded in these 3 instructions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added commnets

As @tannergooding , lets have a comment about containment and also specify that here, the operands are swapped because the way operands are encoded in these 3 instructions.

}
#endif
else
{
// We must have the number of bits to shift stored in ECX, since we constrained this node to
Expand Down
43 changes: 40 additions & 3 deletions src/coreclr/jit/emitxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -749,6 +749,9 @@ bool emitter::TakesRexWPrefix(instruction ins, emitAttr attr)
case INS_pdep:
case INS_pext:
case INS_rorx:
case INS_shlx:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the changes at other places, we have the code to have it supported only for TARGET_64BIT, but here, it is under TARGET_ARM64. What is the difference and should it be consistent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is under #ifdef TARGET_AMD64, not ARM64. So, I guess it is correct.

case INS_sarx:
case INS_shrx:
return true;
default:
return false;
Expand Down Expand Up @@ -987,17 +990,32 @@ unsigned emitter::emitOutputRexOrVexPrefixIfNeeded(instruction ins, BYTE* dst, c
case INS_rorx:
case INS_pdep:
case INS_mulx:
// TODO: Unblock when enabled for x86
#ifdef TARGET_AMD64
case INS_shrx:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code path is also for x86?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #ifdef TARGET_64BIT.

#endif
{
vexPrefix |= 0x03;
break;
}

case INS_pext:
// TODO: Unblock when enabled for x86
#ifdef TARGET_AMD64
case INS_sarx:
#endif
{
vexPrefix |= 0x02;
break;
}

// TODO: Unblock when enabled for x86
#ifdef TARGET_AMD64
case INS_shlx:
{
vexPrefix |= 0x01;
break;
}
#endif
default:
{
vexPrefix |= 0x00;
Expand Down Expand Up @@ -1484,6 +1502,11 @@ bool emitter::emitInsCanOnlyWriteSSE2OrAVXReg(instrDesc* id)
case INS_pextrw:
case INS_pextrw_sse41:
case INS_rorx:
#ifdef TARGET_AMD64
case INS_shlx:
case INS_sarx:
case INS_shrx:
#endif
{
// These SSE instructions write to a general purpose integer register.
return false;
Expand Down Expand Up @@ -9519,9 +9542,13 @@ void emitter::emitDispIns(
assert(IsThreeOperandAVXInstruction(ins));
regNumber reg2 = id->idReg2();
regNumber reg3 = id->idReg3();
if (ins == INS_bextr || ins == INS_bzhi)
if (ins == INS_bextr || ins == INS_bzhi
#ifdef TARGET_AMD64
|| ins == INS_shrx || ins == INS_shlx || ins == INS_sarx
#endif
)
{
// BMI bextr and bzhi encodes the reg2 in VEX.vvvv and reg3 in modRM,
// BMI bextr,bzhi, shrx, shlx and sarx encode the reg2 in VEX.vvvv and reg3 in modRM,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// BMI bextr,bzhi, shrx, shlx and sarx encode the reg2 in VEX.vvvv and reg3 in modRM,
// BMI bextr, bzhi, shrx, shlx and sarx encode the reg2 in VEX.vvvv and reg3 in modRM,

We need to have similar comment where we swap the operands I pointed out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You have already added a comment in codegenxarch.cpp. No need here. The above comment covers it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

// which is different from most of other instructions
regNumber tmp = reg2;
reg2 = reg3;
Expand Down Expand Up @@ -16323,6 +16350,16 @@ emitter::insExecutionCharacteristics emitter::getInsExecutionCharacteristics(ins
break;
}

#ifdef TARGET_AMD64
case INS_shlx:
case INS_sarx:
case INS_shrx:
{
result.insLatency += PERFSCORE_LATENCY_1C;
result.insThroughput = PERFSCORE_THROUGHPUT_2X;
break;
}
#endif
default:
// unhandled instruction insFmt combination
perfScoreUnhandledInstruction(id, &result);
Expand Down
5 changes: 5 additions & 0 deletions src/coreclr/jit/instrsxarch.h
Original file line number Diff line number Diff line change
Expand Up @@ -605,6 +605,11 @@ INST3(pdep, "pdep", IUM_WR, BAD_CODE, BAD_CODE,
INST3(pext, "pext", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0xF5), INS_Flags_IsDstDstSrcAVXInstruction) // Parallel Bits Extract
INST3(bzhi, "bzhi", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0xF5), Resets_OF | Writes_SF | Writes_ZF | Undefined_AF | Undefined_PF | Writes_CF | INS_Flags_IsDstDstSrcAVXInstruction) // Zero High Bits Starting with Specified Bit Position
INST3(mulx, "mulx", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0xF6), INS_Flags_IsDstDstSrcAVXInstruction) // Unsigned Multiply Without Affecting Flags
#ifdef TARGET_AMD64
INST3(shlx, "shlx", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0xF7), INS_Flags_IsDstDstSrcAVXInstruction) // Shift Logical Left Without Affecting Flags
INST3(sarx, "sarx", IUM_WR, BAD_CODE, BAD_CODE, PACK4(0xF3, 0x0F, 0x38, 0xF7), INS_Flags_IsDstDstSrcAVXInstruction) // Shift Arithmetic Right Without Affecting Flags
INST3(shrx, "shrx", IUM_WR, BAD_CODE, BAD_CODE, PACK4(0xF2, 0x0F, 0x38, 0xF7), INS_Flags_IsDstDstSrcAVXInstruction) // Shift Logical Right Without Affecting Flags
#endif

INST3(LAST_BMI_INSTRUCTION, "LAST_BMI_INSTRUCTION", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, INS_FLAGS_None)

Expand Down
2 changes: 1 addition & 1 deletion src/coreclr/jit/lowerxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4850,7 +4850,7 @@ void Lowering::ContainCheckShiftRotate(GenTreeOp* node)
assert(source->OperGet() == GT_LONG);
MakeSrcContained(node, source);
}
#endif // !TARGET_X86
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been alrady removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean put it back (revert the deletion of the comment // !TARGET_X86).


GenTree* shiftBy = node->gtOp2;
if (IsContainableImmed(node, shiftBy) && (shiftBy->AsIntConCommon()->IconValue() <= 255) &&
Expand Down
11 changes: 11 additions & 0 deletions src/coreclr/jit/lsraxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -925,6 +925,17 @@ int LinearScan::BuildShiftRotate(GenTree* tree)
{
assert(shiftBy->OperIsConst());
}
#if defined(TARGET_64BIT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this instruction only applicable for x64? I thought it is also valid for x86? @tannergooding ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why it worked for x86 without changing this part of the code?

Is this instruction only applicable for x64? I thought it is also valid for x86? @tannergooding ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't InstructionSet_BMI2 the 32 bit version and for 64 bit you'd want InstructionSet_BMI2_X64 ? If so and you checked 64 bit correctly elsewhere and then 32 here it would explain it working for 32.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checked that my change handles only x64 case. Will open an issue to handle x86.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checked that my change handles only x64 case. Will open an issue to handle x86.

else if (tree->OperIsShift() && !tree->isContained() &&
compiler->compOpportunisticallyDependsOn(InstructionSet_BMI2))
{
// It handles all register forms, but it does not handle contained form for memory operand.
JulieLeeMSFT marked this conversation as resolved.
Show resolved Hide resolved
srcCount += BuildOperandUses(source, srcCandidates);
srcCount += BuildOperandUses(shiftBy, srcCandidates);
BuildDef(tree, dstCandidates);
return srcCount;
}
#endif
else
{
srcCandidates = allRegs(TYP_INT) & ~RBM_RCX;
Expand Down
Loading