Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some minor cleanup post the addition of TYP_SIMD64 and ZMM support - P1 #83044

Merged
merged 6 commits into from
Mar 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 5 additions & 6 deletions docs/design/coreclr/botr/vectors-and-intrinsics.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ Most hardware intrinsics support is tied to the use of various Vector apis. Ther

- The fixed length float vectors. `Vector2`, `Vector3`, and `Vector4`. These vector types represent a struct of floats of various lengths. For type layout, ABI and, interop purposes they are represented in exactly the same way as a structure with an appropriate number of floats in it. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations.
- The variable length `Vector<T>`. This represents vector data of runtime-determined length. In any given process the length of a `Vector<T>` is the same in all methods, but this length may differ between various machines or environment variable settings read at startup of the process. The `T` type variable may be the following types (`System.Byte`, `System.SByte`, `System.Int16`, `System.UInt16`, `System.Int32`, `System.UInt32`, `System.Int64`, `System.UInt64`, `System.Single`, and `System.Double`), and allows use of integer or double data within a vector. The length and alignment of `Vector<T>` is unknown to the developer at compile time (although discoverable at runtime by using the `Vector<T>.Count` api), and `Vector<T>` may not exist in any interop signature. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations if the `Vector<T>.IsHardwareAccelerated` api returns true.
- `Vector64<T>`, `Vector128<T>`, and `Vector256<T>` represent fixed-sized vectors that closely resemble the fixed- sized vectors available in C++. These structures can be used in any code that runs, but very few features are supported directly on these types other than creation. They are used primarily in the processor specific hardware intrinsics apis.
- `Vector64<T>`, `Vector128<T>`, `Vector256<T>`, and `Vector512<T>` represent fixed-sized vectors that closely resemble the fixed- sized vectors available in C++. These structures can be used in any code that runs, but very few features are supported directly on these types other than creation. They are used primarily in the processor specific hardware intrinsics apis.
- Processor specific hardware intrinsics apis such as `System.Runtime.Intrinsics.X86.Ssse3`. These apis map directly to individual instructions or short instruction sequences that are specific to a particular hardware instruction. These apis are only usable on hardware that supports the particular instruction. See https://github.com/dotnet/designs/blob/master/accepted/2018/platform-intrinsics.md for the design of these.

# How to use intrinsics apis

There are 3 models for use of intrinsics apis.

1. Usage of `Vector2`, `Vector3`, `Vector4`, and `Vector<T>`. For these, its always safe to just use the types. The jit will generate code that is as optimal as it can for the logic, and will do so unconditionally.
2. Usage of `Vector64<T>`, `Vector128<T>`, and `Vector256<T>`. These types may be used unconditionally, but are only truly useful when also using the platform specific hardware intrinsics apis.
2. Usage of `Vector64<T>`, `Vector128<T>`, `Vector256<T>`, and `Vector512<T>`. These types may be used unconditionally, but are only truly useful when also using the platform specific hardware intrinsics apis.
3. Usage of platform intrinsics apis. All usage of these apis should be wrapped in an `IsSupported` check of the appropriate kind. Then, within the `IsSupported` check the platform specific api may be used. If multiple instruction sets are used, then the application developer must have checks for the instruction sets as used on each one of them.

# Effect of usage of hardware intrinsics on how code is generated
Expand Down Expand Up @@ -142,7 +142,7 @@ public class BitOperations
#### Crossgen implementation rules
- Any code which uses an intrinsic from the `System.Runtime.Intrinsics.Arm` or `System.Runtime.Intrinsics.X86` namespace will not be compiled AOT. (See code which throws a TypeLoadException using `IDS_EE_HWINTRINSIC_NGEN_DISALLOWED`)
- Any code which uses `Vector<T>` will not be compiled AOT. (See code which throws a TypeLoadException using `IDS_EE_SIMD_NGEN_DISALLOWED`)
- Any code which uses `Vector64<T>`, `Vector128<T>` or `Vector256<T>` will not be compiled AOT. (See code which throws a TypeLoadException using `IDS_EE_HWINTRINSIC_NGEN_DISALLOWED`)
- Any code which uses `Vector64<T>`, `Vector128<T>`, `Vector256<T>`, or `Vector512<T>` will not be compiled AOT. (See code which throws a TypeLoadException using `IDS_EE_HWINTRINSIC_NGEN_DISALLOWED`)
- Non-platform intrinsics which require more hardware support than the minimum supported hardware capability will not take advantage of that capability. In particular the code generated for Vector2/3/4 is sub-optimal. MethodImplOptions.AggressiveOptimization may be used to disable compilation of this sub-par code.

#### Characteristics which result from rules
Expand All @@ -160,10 +160,10 @@ There are 2 sets of instruction sets known to the compiler.
- The baseline instruction set which defaults to (Sse, Sse2), but may be adjusted via compiler option.
- The optimistic instruction set which defaults to (Sse3, Ssse3, Sse41, Sse42, Popcnt, Pclmulqdq, and Lzcnt).

Code will be compiled using the optimistic instruction set to drive compilation, but any use of an instruction set beyond the baseline instruction set will be recorded, as will any attempt to use an instruction set beyond the optimistic set if that attempted use has a semantic effect. If the baseline instruction set includes `Avx2` then the size and characteristics of of `Vector<T>` is known. Any other decisions about ABI may also be encoded. For instance, it is likely that the ABI of `Vector256<T>` will vary based on the presence/absence of `Avx` support.
Code will be compiled using the optimistic instruction set to drive compilation, but any use of an instruction set beyond the baseline instruction set will be recorded, as will any attempt to use an instruction set beyond the optimistic set if that attempted use has a semantic effect. If the baseline instruction set includes `Avx2` then the size and characteristics of of `Vector<T>` is known. Any other decisions about ABI may also be encoded. For instance, it is likely that the ABI of `Vector256<T>` and `Vector512<T>` will vary based on the presence/absence of `Avx` support.

- Any code which uses `Vector<T>` will not be compiled AOT unless the size of `Vector<T>` is known.
- Any code which passes a `Vector256<T>` as a parameter on a Linux or Mac machine will not be compiled AOT unless the support for the `Avx` instruction set is known.
- Any code which passes a `Vector256<T>` or `Vector512<T>` as a parameter on a Linux or Mac machine will not be compiled AOT unless the support for the `Avx` instruction set is known.
- Non-platform intrinsics which require more hardware support than the optimistic supported hardware capability will not take advantage of that capability. MethodImplOptions.AggressiveOptimization may be used to disable compilation of this sub-par code.
- Code which takes advantage of instructions sets in the optimistic set will not be used on a machine which only supports the baseline instruction set.
- Code which attempts to use instruction sets outside of the optimistic set will generate code that will not be used on machines with support for the instruction set.
Expand Down Expand Up @@ -194,7 +194,6 @@ While the above api exists, it is not expected that general purpose code within
|`compExactlyDependsOn(isa)`| Use when making a decision to use or not use an instruction set when the decision will affect the semantics of the generated code. Should never be used in an assert. | Return whether or not an instruction set is supported. Calls notifyInstructionSetUsage with the result of that computation.
|`compOpportunisticallyDependsOn(isa)`| Use when making an opportunistic decision to use or not use an instruction set. Use when the instruction set usage is a "nice to have optimization opportunity", but do not use when a false result may change the semantics of the program. Should never be used in an assert. | Return whether or not an instruction set is supported. Calls notifyInstructionSetUsage if the instruction set is supported.
|`compIsaSupportedDebugOnly(isa)` | Use to assert whether or not an instruction set is supported | Return whether or not an instruction set is supported. Does not report anything. Only available in debug builds.
|`getSIMDSupportLevel()`| Use when determining what codegen to generate for code that operates on `Vector<T>`, `Vector2`, `Vector3` or `Vector4`.| Queries the instruction sets supported using `compOpportunisticallyDependsOn`, and finds a set of instructions available to use for working with the platform agnostic vector types.
|`getSIMDVectorType()`| Use to get the TYP of a the `Vector<T>` type. | Determine the TYP of the `Vector<T>` type. If on the architecture the TYP may vary depending on whatever rules, this function will make sufficient use of the `notifyInstructionSetUsage` api to ensure that the TYP is consistent between compile time and runtime.
|`getSIMDVectorRegisterByteLength()` | Use to get the size of a `Vector<T>` value. | Determine the size of the `Vector<T>` type. If on the architecture the size may vary depending on whatever rules, this function will make sufficient use of the `notifyInstructionSetUsage` api to ensure that the size is consistent between compile time and runtime.
|`maxSIMDStructBytes()`| Get the maximum number of bytes that might be used in a SIMD type during this compilation. | Query the set of instruction sets supported, and determine the largest simd type supported. Use `compOpportunisticallyDependsOn` to perform the queries so that the maximum size needed is the only one recorded.
Expand Down
20 changes: 15 additions & 5 deletions src/coreclr/jit/codegencommon.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1731,6 +1731,7 @@ void CodeGen::genGenerateMachineCode()

printf(" for ");

#if defined(TARGET_X86)
if (compiler->info.genCPU == CPU_X86)
{
printf("generic X86 CPU");
Expand All @@ -1739,9 +1740,14 @@ void CodeGen::genGenerateMachineCode()
{
printf("Pentium 4");
}
else if (compiler->info.genCPU == CPU_X64)
#elif defined(TARGET_AMD64)
if (compiler->info.genCPU == CPU_X64)
{
if (compiler->canUseVexEncoding())
if (compiler->canUseEvexEncoding())
{
printf("X64 CPU with AVX512");
}
else if (compiler->canUseVexEncoding())
{
printf("X64 CPU with AVX");
}
Expand All @@ -1750,18 +1756,22 @@ void CodeGen::genGenerateMachineCode()
printf("X64 CPU with SSE2");
}
}
else if (compiler->info.genCPU == CPU_ARM)
#elif defined(TARGET_ARM)
if (compiler->info.genCPU == CPU_ARM)
{
printf("generic ARM CPU");
}
else if (compiler->info.genCPU == CPU_ARM64)
#elif defined(TARGET_ARM64)
if (compiler->info.genCPU == CPU_ARM64)
{
printf("generic ARM64 CPU");
}
else if (compiler->info.genCPU == CPU_LOONGARCH64)
#elif defined(TARGET_LOONGARCH64)
if (compiler->info.genCPU == CPU_LOONGARCH64)
{
printf("generic LOONGARCH64 CPU");
}
#endif
else
{
printf("unknown architecture");
Expand Down
1 change: 0 additions & 1 deletion src/coreclr/jit/codegenxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10709,7 +10709,6 @@ void CodeGen::genZeroInitFrameUsingBlockInit(int untrLclHi, int untrLclLo, regNu
assert(compiler->compGeneratingProlog);
assert(genUseBlockInit);
assert(untrLclHi > untrLclLo);
assert(compiler->getSIMDSupportLevel() >= SIMD_SSE2_Supported);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing this assert is not needed? Is the idea that if AVX2 is not supported, then the default is SSE2 now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default has been SSE2 for many years (always for x64 and since around .NET Core 2.1 for x86). We require it and consider it part of the baseline ISA, so there is no need to check for its existence outside of supporting the DOTNET_EnableSSE2=0 switch for HWIntrinsic importation.


emitter* emit = GetEmitter();
regNumber frameReg = genFramePointerReg();
Expand Down
5 changes: 1 addition & 4 deletions src/coreclr/jit/compiler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2238,11 +2238,8 @@ void Compiler::compSetProcessor()
info.genCPU = CPU_X86_PENTIUM_4;
else
info.genCPU = CPU_X86;

#elif defined(TARGET_LOONGARCH64)

info.genCPU = CPU_LOONGARCH64;

info.genCPU = CPU_LOONGARCH64;
#endif

//
Expand Down
79 changes: 24 additions & 55 deletions src/coreclr/jit/compiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -8323,34 +8323,6 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
}
#endif // DEBUG

// Get highest available level for SIMD codegen
SIMDLevel getSIMDSupportLevel()
{
#if defined(TARGET_XARCH)
if (compOpportunisticallyDependsOn(InstructionSet_AVX2))
{
if (compOpportunisticallyDependsOn(InstructionSet_Vector512))
{
return SIMD_Vector512_Supported;
}

return SIMD_AVX2_Supported;
}

if (compOpportunisticallyDependsOn(InstructionSet_SSE42))
{
return SIMD_SSE4_Supported;
}

// min bar is SSE2
return SIMD_SSE2_Supported;
#else
assert(!"Available instruction set(s) for SIMD codegen is not defined for target arch");
unreached();
return SIMD_Not_Supported;
#endif
}

bool isIntrinsicType(CORINFO_CLASS_HANDLE clsHnd)
{
return info.compCompHnd->isIntrinsicType(clsHnd);
Expand Down Expand Up @@ -8781,16 +8753,14 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
var_types getSIMDVectorType()
{
#if defined(TARGET_XARCH)
// TODO-XArch-AVX512 : Return TYP_SIMD64 once Vector<T> supports AVX512.
if (getSIMDSupportLevel() >= SIMD_AVX2_Supported)
if (compOpportunisticallyDependsOn(InstructionSet_AVX2))
{
// TODO-XArch-AVX512 : Return TYP_SIMD64 once Vector<T> supports AVX512.
return TYP_SIMD32;
}
else
{
// Verify and record that AVX2 isn't supported
compVerifyInstructionSetUnusable(InstructionSet_AVX2);
assert(getSIMDSupportLevel() >= SIMD_SSE2_Supported);
return TYP_SIMD16;
}
#elif defined(TARGET_ARM64)
Expand Down Expand Up @@ -8823,16 +8793,14 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
unsigned getSIMDVectorRegisterByteLength()
{
#if defined(TARGET_XARCH)
// TODO-XArch-AVX512 : Return ZMM_REGSIZE_BYTES once Vector<T> supports AVX512.
if (getSIMDSupportLevel() >= SIMD_AVX2_Supported)
if (compOpportunisticallyDependsOn(InstructionSet_AVX2))
{
// TODO-XArch-AVX512 : Return ZMM_REGSIZE_BYTES once Vector<T> supports AVX512.
return YMM_REGSIZE_BYTES;
}
else
{
// Verify and record that AVX2 isn't supported
compVerifyInstructionSetUnusable(InstructionSet_AVX2);
assert(getSIMDSupportLevel() >= SIMD_SSE2_Supported);
return XMM_REGSIZE_BYTES;
}
#elif defined(TARGET_ARM64)
Expand All @@ -8847,9 +8815,11 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

// maxSIMDStructBytes
// The minimum SIMD size supported by System.Numeric.Vectors or System.Runtime.Intrinsic
// SSE: 16-byte Vector<T> and Vector128<T>
// AVX: 32-byte Vector256<T> (Vector<T> is 16-byte)
// AVX2: 32-byte Vector<T> and Vector256<T>
// Arm.AdvSimd: 16-byte Vector<T> and Vector128<T>
// X86.SSE: 16-byte Vector<T> and Vector128<T>
// X86.AVX: 16-byte Vector<T> and Vector256<T>
// X86.AVX2: 32-byte Vector<T> and Vector256<T>
// X86.AVX512F: 32-byte Vector<T> and Vector512<T>
unsigned int maxSIMDStructBytes()
{
#if defined(FEATURE_HW_INTRINSICS) && defined(TARGET_XARCH)
Expand All @@ -8859,17 +8829,22 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
{
return ZMM_REGSIZE_BYTES;
}
return YMM_REGSIZE_BYTES;
else
{
compVerifyInstructionSetUnusable(InstructionSet_AVX512F);
return YMM_REGSIZE_BYTES;
}
}
else
{
// Verify and record that AVX2 isn't supported
compVerifyInstructionSetUnusable(InstructionSet_AVX2);
assert(getSIMDSupportLevel() >= SIMD_SSE2_Supported);
compVerifyInstructionSetUnusable(InstructionSet_AVX);
return XMM_REGSIZE_BYTES;
}
#elif defined(TARGET_ARM64)
return FP_REGSIZE_BYTES;
#else
return getSIMDVectorRegisterByteLength();
assert(!"maxSIMDStructBytes() unimplemented on target arch");
unreached();
#endif
}

Expand Down Expand Up @@ -9134,13 +9109,10 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#endif
}

#ifdef TARGET_XARCH
bool canUseVexEncoding() const
{
#ifdef TARGET_XARCH
return compOpportunisticallyDependsOn(InstructionSet_AVX);
#else
return false;
#endif
}

//------------------------------------------------------------------------
Expand All @@ -9151,8 +9123,6 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
//
bool canUseEvexEncoding() const
{
#ifdef TARGET_XARCH

#ifdef DEBUG
if (JitConfig.JitForceEVEXEncoding())
{
Expand All @@ -9161,9 +9131,6 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#endif // DEBUG

return compOpportunisticallyDependsOn(InstructionSet_AVX512F);
#else
return false;
#endif
}

//------------------------------------------------------------------------
Expand All @@ -9174,7 +9141,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
//
bool DoJitStressEvexEncoding() const
{
#if defined(TARGET_XARCH) && defined(DEBUG)
#ifdef DEBUG
// Using JitStressEVEXEncoding flag will force instructions which would
// otherwise use VEX encoding but can be EVEX encoded to use EVEX encoding
// This requires AVX512VL support. JitForceEVEXEncoding forces this encoding, thus
Expand All @@ -9184,14 +9151,16 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
{
return true;
}

if (JitConfig.JitStressEvexEncoding() && compOpportunisticallyDependsOn(InstructionSet_AVX512F_VL))
tannergooding marked this conversation as resolved.
Show resolved Hide resolved
{
return true;
}
#endif // TARGET_XARCH && DEBUG
#endif // DEBUG

return false;
}
#endif // TARGET_XARCH

/*
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Expand Down
13 changes: 6 additions & 7 deletions src/coreclr/jit/emit.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2609,17 +2609,16 @@ void emitter::emitSetFrameRangeArgs(int offsLo, int offsHi)

/*****************************************************************************
*
* A conversion table used to map an operand size value (in bytes) into its
* small encoding (0 through 3), and vice versa.
* A conversion table used to map an operand size value (in bytes) into its emitAttr
*/

const emitter::opSize emitter::emitSizeEncode[] = {
emitter::OPSZ1, emitter::OPSZ2, emitter::OPSZ4, emitter::OPSZ8, emitter::OPSZ16, emitter::OPSZ32, emitter::OPSZ64,
const emitAttr emitter::emitSizeDecode[emitter::OPSZ_COUNT] = {
EA_1BYTE, EA_2BYTE, EA_4BYTE, EA_8BYTE, EA_16BYTE,
#if defined(TARGET_XARCH)
EA_32BYTE, EA_64BYTE,
#endif // TARGET_XARCH
};

const emitAttr emitter::emitSizeDecode[emitter::OPSZ_COUNT] = {EA_1BYTE, EA_2BYTE, EA_4BYTE, EA_8BYTE,
EA_16BYTE, EA_32BYTE, EA_64BYTE};

/*****************************************************************************
*
* Allocate an instruction descriptor for an instruction that uses both
Expand Down
Loading