-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Avx10v1
to the runtime
#99784
Conversation
@dotnet/avx512-contrib @BruceForstall Is there a similar group for 'arch-avx10' that we can add? Don't think we have permission to add the 'arch-avx10' label |
src/coreclr/inc/clrconfigvalues.h
Outdated
RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableAVX10v1_V256, W("EnableAVX10v1_V256"), 1, "Allows AVX10v1_V256+ hardware intrinsics to be disabled") | ||
RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableAVX10v1_V512, W("EnableAVX10v1_V512"), 1, "Allows AVX10v1_V512+ hardware intrinsics to be disabled") | ||
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_Avx10MaxVectorLength, W("Avx10MaxVectorLength"), 0, "The max vector length supported", CLRConfig::LookupOptions::ParseIntegerAsBase10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need these three.
In general these DOTNET_Enable{Isa}
knobs are primarily there for testing purposes. That is, they exist so that developers with newer hardware can still test downlevel code paths without having to recompile their applications. Such code is not typically meant for use in actual production scenarios.
To that end, we typically expose one knob per CPUID ISA bit and so you can disable Avx
but not Avx.X64
. Avx512F.VL
was somewhat an exception and in hindsight, probably a knob we don't actually need to let users control. I expect that the V256
and V512
nested classes fit into the same bucket, where users don't need the ability to enable/disable them directly and independently of Avx10
itself.
Instead, users who need to control access to the used vector bit width can utilize the DOTNET_PreferredVectorBitWidth
knob and the corresponding Vector###.IsHardwareAccelerated
checks, which is how they're expected to detect this in existing code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- @BruceForstall, there should be no issue in removing a config knob like the various DOTNET_EnableAvx512*_VL
entries, right?
We can simply remove them and document that users should utilize DOTNET_PreferredVectorBitWidth
and DOTNET_EnableAvx512*
instead? This would allow a consistent user story here and reduce the overall complexity we need to support/consider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any problem with removing the DOTNET_EnableAvx512*_VL
configs. They are available in Release and have been shipped, and use the EXTERNAL
prefix, which indicates (or at least historically indicated) a documented config. (Maybe we should have used the prefix UNSUPPORTED
?) So technically removing them is a breaking change, but it doesn't seem like a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback.
Please correct me if I am wrong, so we want to remove DOTNET_EnableAvx10v1_V256/512
and DOTNET_Avx10MaxVectorLength
. And keep DOTNET_EnableAvx10v1
, let it plus DOTNET_PreferredVectorBitWidth
to control the emit behavior of Vector256/512
APIs. (Which I suppose won't be covered in this PR.)
As for the VL
vars, do we want to handle it in this PR, or separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we want to
Yep!
Which I suppose won't be covered in this PR
This should already be somewhat implicit based on the existing checks we have in the JIT, so I don't expect we need to do anything special. Users will still see Avx10v1.V512
report supported if the hardware actually supports it and they would simply check Vector512.IsHardwareAccelerated
if they don't want limit 512-bit usage to the user selected behavior (which is controlled by PreferredVectorBitWidth
).
As for the VL vars, do we want to handle it in this PR, or separately?
I'll cover it in a separate PR.
src/coreclr/tools/Common/JitInterface/ThunkGenerator/InstructionSetDesc.txt
Show resolved
Hide resolved
Removed seperate env vars to control Avx10v1_V256/512, now Avx10v1 instructions with different vector length will all be controlled by EnableAvx10v1 alone.
public const int Avx10v1 = 20000000; | ||
public const int Avx10v1_v256 = 40000000; | ||
public const int Avx10v1_v512 = 50000000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be hex
and the last one should be 8
, not 5
.
public const int Avx10v1 = 20000000; | |
public const int Avx10v1_v256 = 40000000; | |
public const int Avx10v1_v512 = 50000000; | |
public const int Avx10v1 = 0x20000000; | |
public const int Avx10v1_v256 = 0x40000000; | |
public const int Avx10v1_v512 = 0x80000000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public const int Avx10v1_v512 = 0x80000000;
This line will raise an error that 0x80000000 cannot be implicitly converted to int
, although it can be fixed by explicitly defining as:
public const int Avx10v1_v512 = -2147483648;
, I wonder if this is a proper fix (I suppose it should be good in terms of the number itself as it is just a flag bit), considering it has reached the max number of ISA this design can hold, and we are expecting some more ISAs coming in the future, e.g. higher version of Avx10 and APX, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either public const int Avx10v1_v512 = -2147483648;
or public const int Avx10v1_v512 = unchecked((int)0x80000000));
should be fine
Changing it to uint
or ulong
to hold more bits might also be appropriate, it's probably a better question for @MichalStrehovsky on how he wants it handled long term for this part of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing it to uint or ulong to hold more bits might also be appropriate, it's probably a better question for @MichalStrehovsky on how he wants it handled long term for this part of the codebase.
I don't have an opinion; whatever works. At this pace we may run out of ulong bits too. I don't have a good understanding for why we need so many bits: the VL suffixes look redundant; the various VectorT are also redundant with other flags we set at the same time, etc.
I was involved in the design for this up to AVX2/BMI but haven't really though about it since. The old design doesn't scale if we use bits at this pace because we can't make SSE3/SSE41/etc. baseline yet (and free up bits that way), but we may already need new bits. It would be better for someone who actually understands this to design a shape that will work.
There is an advantage if this can fit in 32bits (run-time check when IsSupported is done dynamically at runtime use the same enum and 32bits are simpler), but we can also introduce yet another enum for those purposes that doesn't waste bits, or have two ints, or something like that. If we can reclaim those 7 bits, it might be enough breathing room to keep using this enum and 32bits.
Debug.Assert(InstructionSet.X64_AVX10v1_V256 == InstructionSet.X86_AVX10v1_V256); | ||
if (supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX10v1_V256)) | ||
{ | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX10v1)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX2)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_FMA)); | ||
} | ||
|
||
Debug.Assert(InstructionSet.X64_AVX10v1_V512 == InstructionSet.X86_AVX10v1_V512); | ||
if (supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX10v1_V512)) | ||
{ | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX10v1_V256)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512F)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512F_VL)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512BW)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512BW_VL)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512CD)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512CD_VL)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512DQ)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512DQ_VL)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512VBMI)); | ||
Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512VBMI_VL)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need these.
The main intent of these paths is to add optimisticInstructionSet
that is ones where a cached CPUID check is beneficial for AOT code. I don't think that's worth doing for Avx10v1
here since it would only be applicable if Avx512F
was there and that itself requires VL
support.
The Debug.Assert
are then there to setup the expectation that Avx512F
was only supported when BW+CD+DQ+VL
were also supported, even though they are technically supersets of Avx512F
. We don't have a similar consideration here so no Debug.Assert
is needed
src/native/minipal/cpufeatures.c
Outdated
const int versionMask = 0xFF; // [7:0] | ||
if((cpuidInfo[CPUID_EBX] & versionMask) >= 1) // version higher than 1 | ||
{ | ||
result |= XArchIntrinsicConstants_Avx10v1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We should explicitly check EBX[16]
is set to indicate AVX10/V128
. The bit exists, so we should be explicit about checking it.
src/native/minipal/cpufeatures.c
Outdated
const int vector256Mask = (1 << 17); | ||
const int vector512Mask = (1 << 18); | ||
if((cpuidInfo[CPUID_EBX] & vector256Mask) != 0) | ||
{ | ||
result |= XArchIntrinsicConstants_Avx10v1_V256; | ||
} | ||
|
||
if((cpuidInfo[CPUID_EBX] & vector512Mask) != 0) | ||
{ | ||
result |= XArchIntrinsicConstants_Avx10v1_V512; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These can only be considered if AVX10/V128
is set and if the version
was greater than 1
, so we should have them appropriately nested
src/native/minipal/cpufeatures.c
Outdated
result |= XArchIntrinsicConstants_Avx10v1; | ||
} | ||
|
||
const int vector256Mask = (1 << 17); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: None of the other checks use a const int
like this, they inline the (1 << 17)
into the if statement and then add a comment indicating what bit it was checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make it to the commonly used form, it looks unnecessary to define this way as the bit-shift form should be easy enough to read in this case.
1. Make sure the version check and length check are correctly nested. 2. remove unnecessary debug assert in mananged code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
CC. @dotnet/jit-contrib for secondary review and merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please resolve the merge conflict with the JIT GUID
CPUCompileFlags.Set(InstructionSet_AVX10v1); | ||
} | ||
|
||
if (((cpuFeatures & XArchIntrinsicConstants_Avx10v1_V256) != 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this also need && CLRConfig::GetConfigValue(CLRConfig::EXTERNAL_EnableAVX10v1)
?
And the Avx10v1_V512 case below?
That is, shouldn't all the AVX10 instruction sets be subject to the same configuration flag? (Since you didn't add separate flags for each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the discussion here.
We intended to let the implication that Avx10v1_V512
guarantees Avx10v1_V256
, and Avx10v1_V256
guarantees Avx10v1
to handle the invalid case (The implication is confirmed by the Avx10 spec doc.). Say, the function EnusreValidInstructionSetSupport
would cancel the illegal flag combination set during the flow.
I can put some comments there if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So on an AVX10 + V512 machine, with DOTNET_EnableAVX10v1=0
, we'll leave this code with:
InstructionSet_AVX10v1_V256
InstructionSet_AVX10v1_V512
set but InstructionSet_AVX10v1
unset.
Then, we call CPUCompileFlags.EnsureValidInstructionSetSupport();
to "clean it up" in EnsureInstructionSetFlagsAreValid` which has this new code:
if (resultflags.HasInstructionSet(InstructionSet_AVX10v1_V512) && !resultflags.HasInstructionSet(InstructionSet_AVX10v1_V256))
resultflags.RemoveInstructionSet(InstructionSet_AVX10v1_V512);
if (resultflags.HasInstructionSet(InstructionSet_AVX10v1_V256) && !resultflags.HasInstructionSet(InstructionSet_AVX10v1))
resultflags.RemoveInstructionSet(InstructionSet_AVX10v1_V256);
if (resultflags.HasInstructionSet(InstructionSet_AVX10v1) && !resultflags.HasInstructionSet(InstructionSet_AVX2))
resultflags.RemoveInstructionSet(InstructionSet_AVX10v1);
The first case won't hit, but the others will. That seems to mean that InstructionSet_AVX10v1_V512
will not be disabled.
If the code ran in a different order, it would work (i.e., check from deepest in the dependency tree first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for pointing out, in that case, if we want to keep the idea that only check EnableAvx10v1
once, I can make Avx10v1_V512
implies Avx10v1
and Avx10_V256
, so there will be an extra check like:
if (resultflags.HasInstructionSet(InstructionSet_AVX10v1_V512) && !resultflags.HasInstructionSet(InstructionSet_AVX10v1))
resultflags.RemoveInstructionSet(InstructionSet_AVX10v1_V512);
this should be able to avoid the described case.
I presume changing the order for the checks could fix the issue but might not be a solid solution, as if the order changes somehow in the future, this "implicit" bug could be hard to find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it makes sense to add:
implication ,X86 ,AVX10v1_V512 ,AVX10v1
and the auto-generated code will do the rest?
Seems like either implication handling could be improved to handle transitive implications better, or, implications are not expected to be transitive. It would be helpful if the documentation/comments made it clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I will document the conclusion in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue may just be in the order the implications are listed in the doc. We typically want the newest ISAs listed at the bottom
If this PR is no longer in "Draft", then please mark it "Ready for review". |
I am not sure how I can resolve this conflict, shall I just rebase the branch and overwrite the GUID with the one in my branch, or I need to rebase and then re-generate the ID? |
I believe "merge" is preferred over "rebase". In any case, simply overwrite the GUID with yours. No need to create a new GUID, since yours is already new and hasn't been used. |
1. update the ISA implication
src/coreclr/tools/Common/JitInterface/ThunkGenerator/InstructionSetDesc.txt
Outdated
Show resolved
Hide resolved
adjust the order of ISA implication.
Failures look irrelevant. |
Overview
As discussed in #98069, we will expose
Avx10
converged vector ISA into the runtime. This PR is adding CPUID detection logics of the first version.Following the spec doc (C1 1.2), availability of
Avx10v1
will be checked in the given paradigm:We also exposed the following environment variables to control the ISA support:
Avx10MaxVectorLength
: an integer that specifies the max vector length expected.EnableAvx10v1
: covers all the instructions operating on the length of 128.EnableAvx10v1_V256
: covers all the instructions operating on the length of 256.EnableAvx10v1_V512
: covers all the instructions operating on the length of 512.Avx10MaxVectorLength
has priority over the rest of the three vars, e.g. whenAvx10MaxVectorLength < 512
andEnableAvx10v1_V512 = 1
, Instructions underAvx10v1_V512
will not be available.