Avx512 extract most significant bits #82731

anthonycanino · 2023-02-27T19:59:11Z

PR addresses #80820.

This adds a new type, TYP_MASK which represents Intel opmask k registers on xarch platform. Enough infrastructure was added so that Vector512.ExtractMostSignificantBits could be hardware accelerated on AVX512 capable platforms, which requires the use of the k-mask registers.

A future PR will implement mask registers in full, which depends upon VectorMask API proposal. At the moment, only a single mask register will ever be used, and is used as an internal register for the ExtractMostSignificantBits method. Follow up PRs will extend this to the Vector512 comparison operators, which require a similar approach.

As an example, the following code...

ulong x = v1.ExtractMostSignificantBits(); // v1 is `Vector512<long>`

roughly lowers to

vmovups zmm0, zmmword ptr[rcx]
vpmovq2m k1, zmm0
kmovb rax, k1

ghost · 2023-02-27T19:59:24Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

PR addresses #80820.

This adds a new type, TYP_MASK which represents Intel opmask k registers on xarch platform. Enough infrastructure was added so that Vector512.ExtractMostSignificantBits could be hardware accelerated on AVX512 capable platforms, which requires the use of the k-mask registers.

A future PR will implement mask registers in full, which depends upon VectorMask API proposal. At the moment, only a single mask register will ever be used, and is used as an internal register for the ExtractMostSignificantBits method. Follow up PRs will extend this to the Vector512 comparison operators, which require a similar approach.

As an example, the following code...

ulong x = v1.ExtractMostSignificantBits(); // v1 is `Vector512<long>`

roughly lowers to

vmovups zmm0, zmmword ptr[rcx]
vpmovq2m k1, zmm0
kmovb rax, k1

Author:	anthonycanino
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `community-contribution`
Milestone:	-

kunalspathak

Made the 1st pass. I have reviewed just the portion that is not part of #80960

src/coreclr/jit/vartype.h

src/coreclr/jit/lsra.cpp

src/coreclr/jit/lsrabuild.cpp

src/coreclr/jit/hwintrinsiclistxarch.h

src/coreclr/jit/targetx86.h

anthonycanino · 2023-03-03T00:28:30Z

@dotnet/avx512-contrib can we get a rerun? Looks like some stuff was timed out?

anthonycanino · 2023-03-04T00:53:00Z

Looks like everything is passing, I have made the requested changes if you would like to give it another pass.

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

kunalspathak · 2023-03-04T03:22:59Z

Changes LGTM. Once #80960 is merged, let us do a final superpmi measurement before merging.

BruceForstall · 2023-03-06T06:05:34Z

@anthonycanino Please resolve the conflicts.

anthonycanino · 2023-03-06T17:21:35Z

I just pulled the changes onto a fresh branch based on main after its dependent PR was merged.

kunalspathak · 2023-03-06T20:53:59Z

Seems like a TP hit, possibly because of added checks for K-instruction in emitter? Can you double check?

Also, I am surprised to see diffs in arm64.

There are few things that should be under TARGET_XARCH like MaskRegisterType TYP_MASK, buildInternalMaskRegisterDefForNode and few more in lsra. Likewise caller of IsKInstruction should wrap it inside #ifdef TARGET_XARCH.

anthonycanino · 2023-03-06T20:55:28Z

Seems like a TP hit, possibly because of added checks for K-instruction in emitter? Can you double check?

Also, I am surprised to see diffs in arm64.

There are few things that should be under TARGET_XARCH like MaskRegisterType TYP_MASK, buildInternalMaskRegisterDefForNode and few more in lsra. Likewise caller of IsKInstruction should wrap it inside #ifdef TARGET_XARCH.

Yes will do shortly.

src/coreclr/vm/threadsuspend.cpp

src/coreclr/jit/typelist.h

BruceForstall · 2023-03-06T22:36:22Z

src/coreclr/jit/typelist.h

@@ -64,6 +64,7 @@ DEF_TP(SIMD16   ,"simd16" , TYP_SIMD16,  TI_STRUCT,16,16, 16,   4,16, VTF_S|VTF_
 DEF_TP(SIMD32   ,"simd32" , TYP_SIMD32,  TI_STRUCT,32,32, 32,   8,16, VTF_S|VTF_VEC)
 DEF_TP(SIMD64   ,"simd64" , TYP_SIMD64,  TI_STRUCT,64,64, 64,  16,16, VTF_S|VTF_VEC)
 #endif // TARGET_XARCH
+DEF_TP(MASK     ,"mask"   , TYP_MASK,    TI_STRUCT,8, 8,  8,    2,8, VTF_S)


Should the type be VTF_S|VTF_VEC? That would let varTypeIsSIMD() be true; do we want that? E.g., it would mean varTypeUsesFloatReg returns true, which we probably don't want since TYP_MASK uses a new register type. @tannergooding

Also, I wonder if it should be VTF_S. It's not really a struct, is it, like the other SIMD types?

Maybe it should be VTF_ANY so it doesn't have any default handling. E.g., we probably don't want varTypeIsStruct() to return true.

It is a struct and will eventually map to the VectorMask<T> type(s).

Under the hood we have just k0-k7 which are 64-bits and while the actual hardware manuals don't differentiate further than k#, the .NET and C++ intrinsics do. -- .NET has VectorMask#<T> while C++ has __mmask8/16/32/64. So VectorMask256<byte> is equivalent to __mmask16 since it has 16 entries.

Given that, is just MASK sufficient or do we need MASK8/16/32/64? If we have more than one, presumably we'd have this be VTF_S|VTF_MSK. -- I could see us going either way with there being different tradeoffs for each approach.

src/coreclr/jit/register.h

src/coreclr/jit/vartype.h

src/coreclr/jit/lsra.h

src/coreclr/jit/emitxarch.cpp

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

anthonycanino · 2023-03-07T13:47:51Z

Seems like a TP hit, possibly because of added checks for K-instruction in emitter? Can you double check?

Also, I am surprised to see diffs in arm64.

There are few things that should be under TARGET_XARCH like MaskRegisterType TYP_MASK, buildInternalMaskRegisterDefForNode and few more in lsra. Likewise caller of IsKInstruction should wrap it inside #ifdef TARGET_XARCH.

FYI, tpdff with no changes...

Overall (+0.32% to +0.52%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.47%
aspnet_block.run.windows.x64.checked.mch	+0.52%
benchmarks.run.windows.x64.checked.mch	+0.34%
coreclr_tests.run.windows.x64.checked.mch	+0.40%
libraries.crossgen2.windows.x64.checked.mch	+0.43%
libraries.pmi.windows.x64.checked.mch	+0.40%
libraries_tests.pmi.windows.x64.checked.mch	+0.32%

MinOpts (+0.53% to +1.18%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.98%
aspnet_block.run.windows.x64.checked.mch	+0.90%
benchmarks.run.windows.x64.checked.mch	+0.74%
coreclr_tests.run.windows.x64.checked.mch	+0.60%
libraries.crossgen2.windows.x64.checked.mch	+1.18%
libraries.pmi.windows.x64.checked.mch	+0.82%
libraries_tests.pmi.windows.x64.checked.mch	+0.53%

FullOpts (+0.23% to +0.43%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.37%
aspnet_block.run.windows.x64.checked.mch	+0.40%
benchmarks.run.windows.x64.checked.mch	+0.34%
coreclr_tests.run.windows.x64.checked.mch	+0.23%
libraries.crossgen2.windows.x64.checked.mch	+0.43%
libraries.pmi.windows.x64.checked.mch	+0.40%
libraries_tests.pmi.windows.x64.checked.mch	+0.31%

Details

All contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	137,122,434,940	137,770,274,733	+0.47%
aspnet_block.run.windows.x64.checked.mch	28,196,864,957	28,342,998,178	+0.52%
benchmarks.run.windows.x64.checked.mch	55,715,871,918	55,907,406,382	+0.34%
coreclr_tests.run.windows.x64.checked.mch	820,707,835,085	823,964,475,471	+0.40%
libraries.crossgen2.windows.x64.checked.mch	123,754,250,291	124,288,544,069	+0.43%
libraries.pmi.windows.x64.checked.mch	234,167,937,468	235,100,838,193	+0.40%
libraries_tests.pmi.windows.x64.checked.mch	506,101,763,851	507,705,699,987	+0.32%

MinOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	22,723,762,098	22,946,475,537	+0.98%
aspnet_block.run.windows.x64.checked.mch	6,648,732,841	6,708,641,851	+0.90%
benchmarks.run.windows.x64.checked.mch	427,800,839	430,954,243	+0.74%
coreclr_tests.run.windows.x64.checked.mch	367,123,809,127	369,344,596,817	+0.60%
libraries.crossgen2.windows.x64.checked.mch	1,712,929	1,733,155	+1.18%
libraries.pmi.windows.x64.checked.mch	1,337,999,031	1,348,970,414	+0.82%
libraries_tests.pmi.windows.x64.checked.mch	5,373,609,215	5,402,007,210	+0.53%

FullOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	114,398,672,842	114,823,799,196	+0.37%
aspnet_block.run.windows.x64.checked.mch	21,548,132,116	21,634,356,327	+0.40%
benchmarks.run.windows.x64.checked.mch	55,288,071,079	55,476,452,139	+0.34%
coreclr_tests.run.windows.x64.checked.mch	453,584,025,958	454,619,878,654	+0.23%
libraries.crossgen2.windows.x64.checked.mch	123,752,537,362	124,286,810,914	+0.43%
libraries.pmi.windows.x64.checked.mch	232,829,938,437	233,751,867,779	+0.40%
libraries_tests.pmi.windows.x64.checked.mch	500,728,154,636	502,303,692,777	+0.31%

tpdiff with most of IsKInstruction check commented out...

Overall (+0.17% to +0.35%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.32%
aspnet_block.run.windows.x64.checked.mch	+0.33%
benchmarks.run.windows.x64.checked.mch	+0.26%
coreclr_tests.run.windows.x64.checked.mch	+0.17%
libraries.crossgen2.windows.x64.checked.mch	+0.35%
libraries.pmi.windows.x64.checked.mch	+0.30%
libraries_tests.pmi.windows.x64.checked.mch	+0.22%

MinOpts (+0.20% to +0.86%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.49%
aspnet_block.run.windows.x64.checked.mch	+0.42%
benchmarks.run.windows.x64.checked.mch	+0.32%
coreclr_tests.run.windows.x64.checked.mch	+0.22%
libraries.crossgen2.windows.x64.checked.mch	+0.86%
libraries.pmi.windows.x64.checked.mch	+0.39%
libraries_tests.pmi.windows.x64.checked.mch	+0.20%

FullOpts (+0.14% to +0.35%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.28%
aspnet_block.run.windows.x64.checked.mch	+0.30%
benchmarks.run.windows.x64.checked.mch	+0.26%
coreclr_tests.run.windows.x64.checked.mch	+0.14%
libraries.crossgen2.windows.x64.checked.mch	+0.35%
libraries.pmi.windows.x64.checked.mch	+0.30%
libraries_tests.pmi.windows.x64.checked.mch	+0.22%

Details

All contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	137,122,327,070	137,555,517,439	+0.32%
aspnet_block.run.windows.x64.checked.mch	28,196,936,039	28,290,027,003	+0.33%
benchmarks.run.windows.x64.checked.mch	55,716,216,346	55,859,859,330	+0.26%
coreclr_tests.run.windows.x64.checked.mch	820,707,397,894	822,143,099,586	+0.17%
libraries.crossgen2.windows.x64.checked.mch	123,754,441,178	124,186,483,562	+0.35%
libraries.pmi.windows.x64.checked.mch	234,169,440,084	234,865,933,264	+0.30%
libraries_tests.pmi.windows.x64.checked.mch	506,106,782,715	507,212,774,247	+0.22%

MinOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	22,723,762,125	22,835,457,571	+0.49%
aspnet_block.run.windows.x64.checked.mch	6,648,733,217	6,676,854,359	+0.42%
benchmarks.run.windows.x64.checked.mch	427,800,890	429,168,439	+0.32%
coreclr_tests.run.windows.x64.checked.mch	367,123,738,248	367,915,727,518	+0.22%
libraries.crossgen2.windows.x64.checked.mch	1,712,930	1,727,636	+0.86%
libraries.pmi.windows.x64.checked.mch	1,337,998,813	1,343,217,279	+0.39%
libraries_tests.pmi.windows.x64.checked.mch	5,373,613,051	5,384,456,794	+0.20%

FullOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	114,398,564,945	114,720,059,868	+0.28%
aspnet_block.run.windows.x64.checked.mch	21,548,202,822	21,613,172,644	+0.30%
benchmarks.run.windows.x64.checked.mch	55,288,415,456	55,430,690,891	+0.26%
coreclr_tests.run.windows.x64.checked.mch	453,583,659,646	454,227,372,068	+0.14%
libraries.crossgen2.windows.x64.checked.mch	123,752,728,248	124,184,755,926	+0.35%
libraries.pmi.windows.x64.checked.mch	232,831,441,271	233,522,715,985	+0.30%
libraries_tests.pmi.windows.x64.checked.mch	500,733,169,664	501,828,317,453	+0.22%

Seems like it was having an impact.

Let's see what the tpdiff looks like now --- most everything was behind proper ifdefs before, but availableMaskRegs instance variable was not, as well as TYP_MASK which would affect arm slightly. I've since moved.

anthonycanino · 2023-03-07T17:49:07Z

@kunalspathak Looks better.

@BruceForstall I mostly addressed your comments, with one caveat --- right now, TYP_MASK is defined if FEATURE_SIMD is present and for XARCH only, consistent with SIMD32 and SIMD64. This creates a bit of ifdef dependency.

BruceForstall · 2023-03-07T18:19:57Z

@anthonycanino Where is the TP regression coming from?

anthonycanino · 2023-03-07T20:14:47Z

@anthonycanino Where is the TP regression coming from?

Investigating with pin now.

src/coreclr/jit/lsra.cpp

src/coreclr/jit/lsra.h

BruceForstall · 2023-03-08T16:48:07Z

Diffs

Throughput impact on Windows x64

The following shows the impact on throughput in terms of number of instructions executed inside the JIT. Negative percentages/lower numbers are better.

linux arm64

Overall (-0.02% to -0.01%)

Collection	PDIFF
benchmarks.run.linux.arm64.checked.mch	-0.01%
coreclr_tests.run.linux.arm64.checked.mch	-0.01%
libraries.crossgen2.linux.arm64.checked.mch	-0.02%
libraries.pmi.linux.arm64.checked.mch	-0.01%
libraries_tests.pmi.linux.arm64.checked.mch	-0.01%

MinOpts (-0.08% to -0.01%)

Collection	PDIFF
benchmarks.run.linux.arm64.checked.mch	-0.06%
coreclr_tests.run.linux.arm64.checked.mch	-0.01%
libraries.crossgen2.linux.arm64.checked.mch	-0.08%
libraries.pmi.linux.arm64.checked.mch	-0.03%
libraries_tests.pmi.linux.arm64.checked.mch	-0.01%

FullOpts (-0.02% to -0.00%)

Collection	PDIFF
benchmarks.run.linux.arm64.checked.mch	-0.01%
libraries.crossgen2.linux.arm64.checked.mch	-0.02%
libraries.pmi.linux.arm64.checked.mch	-0.01%
libraries_tests.pmi.linux.arm64.checked.mch	-0.01%

Details

All contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.linux.arm64.checked.mch	71,829,708,758	71,825,058,259	-0.01%
coreclr_tests.run.linux.arm64.checked.mch	1,120,118,011,248	1,120,045,044,886	-0.01%
libraries.crossgen2.linux.arm64.checked.mch	126,407,564,245	126,385,333,940	-0.02%
libraries.pmi.linux.arm64.checked.mch	239,547,008,704	239,517,121,722	-0.01%
libraries_tests.pmi.linux.arm64.checked.mch	573,413,222,714	573,370,564,634	-0.01%

MinOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.linux.arm64.checked.mch	1,185,157,334	1,184,421,151	-0.06%
coreclr_tests.run.linux.arm64.checked.mch	464,245,669,694	464,200,751,165	-0.01%
libraries.crossgen2.linux.arm64.checked.mch	1,959,736	1,958,112	-0.08%
libraries.pmi.linux.arm64.checked.mch	1,644,582,602	1,644,028,948	-0.03%
libraries_tests.pmi.linux.arm64.checked.mch	7,106,434,127	7,105,485,508	-0.01%

FullOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.linux.arm64.checked.mch	70,644,551,424	70,640,637,108	-0.01%
coreclr_tests.run.linux.arm64.checked.mch	655,872,341,554	655,844,293,721	-0.00%
libraries.crossgen2.linux.arm64.checked.mch	126,405,604,509	126,383,375,828	-0.02%
libraries.pmi.linux.arm64.checked.mch	237,902,426,102	237,873,092,774	-0.01%
libraries_tests.pmi.linux.arm64.checked.mch	566,306,788,587	566,265,079,126	-0.01%

osx arm64

Overall (-0.02% to -0.01%)

Collection	PDIFF
benchmarks.run.osx.arm64.checked.mch	-0.01%
coreclr_tests.run.osx.arm64.checked.mch	-0.01%
libraries.crossgen2.osx.arm64.checked.mch	-0.02%
libraries.pmi.osx.arm64.checked.mch	-0.01%
libraries_tests.pmi.osx.arm64.checked.mch	-0.01%

MinOpts (-0.08% to -0.01%)

Collection	PDIFF
benchmarks.run.osx.arm64.checked.mch	-0.03%
coreclr_tests.run.osx.arm64.checked.mch	-0.01%
libraries.crossgen2.osx.arm64.checked.mch	-0.08%
libraries.pmi.osx.arm64.checked.mch	-0.03%
libraries_tests.pmi.osx.arm64.checked.mch	-0.01%

FullOpts (-0.02% to -0.00%)

Collection	PDIFF
benchmarks.run.osx.arm64.checked.mch	-0.01%
libraries.crossgen2.osx.arm64.checked.mch	-0.02%
libraries.pmi.osx.arm64.checked.mch	-0.01%
libraries_tests.pmi.osx.arm64.checked.mch	-0.01%

Details

All contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.osx.arm64.checked.mch	38,164,974,081	38,162,491,846	-0.01%
coreclr_tests.run.osx.arm64.checked.mch	859,418,196,995	859,356,913,472	-0.01%
libraries.crossgen2.osx.arm64.checked.mch	133,045,192,209	133,021,837,042	-0.02%
libraries.pmi.osx.arm64.checked.mch	248,363,814,282	248,331,667,102	-0.01%
libraries_tests.pmi.osx.arm64.checked.mch	546,442,041,072	546,399,704,570	-0.01%

MinOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.osx.arm64.checked.mch	574,094,530	573,941,898	-0.03%
coreclr_tests.run.osx.arm64.checked.mch	461,565,891,532	461,521,207,939	-0.01%
libraries.crossgen2.osx.arm64.checked.mch	2,111,547	2,109,807	-0.08%
libraries.pmi.osx.arm64.checked.mch	1,699,673,406	1,699,089,772	-0.03%
libraries_tests.pmi.osx.arm64.checked.mch	7,623,536,291	7,622,404,286	-0.01%

FullOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.osx.arm64.checked.mch	37,590,879,551	37,588,549,948	-0.01%
coreclr_tests.run.osx.arm64.checked.mch	397,852,305,463	397,835,705,533	-0.00%
libraries.crossgen2.osx.arm64.checked.mch	133,043,080,662	133,019,727,235	-0.02%
libraries.pmi.osx.arm64.checked.mch	246,664,140,876	246,632,577,330	-0.01%
libraries_tests.pmi.osx.arm64.checked.mch	538,818,504,781	538,777,300,284	-0.01%

linux x64

Overall (+0.13% to +0.26%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.13%
coreclr_tests.run.linux.x64.checked.mch	+0.26%
libraries.crossgen2.linux.x64.checked.mch	+0.16%
libraries.pmi.linux.x64.checked.mch	+0.16%
libraries_tests.pmi.linux.x64.checked.mch	+0.14%

MinOpts (+0.42% to +0.67%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.67%
coreclr_tests.run.linux.x64.checked.mch	+0.45%
libraries.crossgen2.linux.x64.checked.mch	+0.64%
libraries.pmi.linux.x64.checked.mch	+0.58%
libraries_tests.pmi.linux.x64.checked.mch	+0.42%

FullOpts (+0.12% to +0.16%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.13%
coreclr_tests.run.linux.x64.checked.mch	+0.12%
libraries.crossgen2.linux.x64.checked.mch	+0.16%
libraries.pmi.linux.x64.checked.mch	+0.16%
libraries_tests.pmi.linux.x64.checked.mch	+0.14%

Details

All contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.linux.x64.checked.mch	69,973,416,200	70,067,485,678	+0.13%
coreclr_tests.run.linux.x64.checked.mch	870,208,033,751	872,470,252,016	+0.26%
libraries.crossgen2.linux.x64.checked.mch	92,833,460,634	92,985,284,214	+0.16%
libraries.pmi.linux.x64.checked.mch	223,281,109,753	223,646,234,128	+0.16%
libraries_tests.pmi.linux.x64.checked.mch	512,114,946,866	512,853,456,023	+0.14%

MinOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.linux.x64.checked.mch	931,862,432	938,085,561	+0.67%
coreclr_tests.run.linux.x64.checked.mch	372,511,404,092	374,175,448,655	+0.45%
libraries.crossgen2.linux.x64.checked.mch	1,455,829	1,465,121	+0.64%
libraries.pmi.linux.x64.checked.mch	1,323,192,173	1,330,840,704	+0.58%
libraries_tests.pmi.linux.x64.checked.mch	6,268,867,376	6,294,950,342	+0.42%

FullOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.linux.x64.checked.mch	69,041,553,768	69,129,400,117	+0.13%
coreclr_tests.run.linux.x64.checked.mch	497,696,629,659	498,294,803,361	+0.12%
libraries.crossgen2.linux.x64.checked.mch	92,832,004,805	92,983,819,093	+0.16%
libraries.pmi.linux.x64.checked.mch	221,957,917,580	222,315,393,424	+0.16%
libraries_tests.pmi.linux.x64.checked.mch	505,846,079,490	506,558,505,681	+0.14%

windows arm64

Overall (-0.02% to -0.01%)

Collection	PDIFF
benchmarks.run.windows.arm64.checked.mch	-0.01%
coreclr_tests.run.windows.arm64.checked.mch	-0.01%
libraries.crossgen2.windows.arm64.checked.mch	-0.02%
libraries.pmi.windows.arm64.checked.mch	-0.01%
libraries_tests.pmi.windows.arm64.checked.mch	-0.01%

MinOpts (-0.08% to -0.01%)

Collection	PDIFF
benchmarks.run.windows.arm64.checked.mch	-0.03%
coreclr_tests.run.windows.arm64.checked.mch	-0.01%
libraries.crossgen2.windows.arm64.checked.mch	-0.08%
libraries.pmi.windows.arm64.checked.mch	-0.03%
libraries_tests.pmi.windows.arm64.checked.mch	-0.02%

FullOpts (-0.02% to -0.00%)

Collection	PDIFF
benchmarks.run.windows.arm64.checked.mch	-0.01%
libraries.crossgen2.windows.arm64.checked.mch	-0.02%
libraries.pmi.windows.arm64.checked.mch	-0.01%
libraries_tests.pmi.windows.arm64.checked.mch	-0.01%

Details

All contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.windows.arm64.checked.mch	54,201,662,021	54,198,199,687	-0.01%
coreclr_tests.run.windows.arm64.checked.mch	1,051,165,432,682	1,051,097,135,354	-0.01%
libraries.crossgen2.windows.arm64.checked.mch	143,600,959,958	143,576,062,944	-0.02%
libraries.pmi.windows.arm64.checked.mch	251,844,611,656	251,813,458,793	-0.01%
libraries_tests.pmi.windows.arm64.checked.mch	551,090,992,404	551,050,520,240	-0.01%

MinOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.windows.arm64.checked.mch	545,947,019	545,795,218	-0.03%
coreclr_tests.run.windows.arm64.checked.mch	465,525,194,605	465,479,805,506	-0.01%
libraries.crossgen2.windows.arm64.checked.mch	2,118,887	2,117,147	-0.08%
libraries.pmi.windows.arm64.checked.mch	1,733,086,668	1,732,517,304	-0.03%
libraries_tests.pmi.windows.arm64.checked.mch	5,936,698,580	5,935,782,255	-0.02%

FullOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.windows.arm64.checked.mch	53,655,715,002	53,652,404,469	-0.01%
coreclr_tests.run.windows.arm64.checked.mch	585,640,238,077	585,617,329,848	-0.00%
libraries.crossgen2.windows.arm64.checked.mch	143,598,841,071	143,573,945,797	-0.02%
libraries.pmi.windows.arm64.checked.mch	250,111,524,988	250,080,941,489	-0.01%
libraries_tests.pmi.windows.arm64.checked.mch	545,154,293,824	545,114,737,985	-0.01%

windows x64

Overall (+0.11% to +0.27%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.22%
aspnet_block.run.windows.x64.checked.mch	+0.25%
benchmarks.run.windows.x64.checked.mch	+0.11%
coreclr_tests.run.windows.x64.checked.mch	+0.27%
libraries.crossgen2.windows.x64.checked.mch	+0.16%
libraries.pmi.windows.x64.checked.mch	+0.15%
libraries_tests.pmi.windows.x64.checked.mch	+0.14%

MinOpts (+0.43% to +0.66%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.66%
aspnet_block.run.windows.x64.checked.mch	+0.64%
benchmarks.run.windows.x64.checked.mch	+0.58%
coreclr_tests.run.windows.x64.checked.mch	+0.47%
libraries.crossgen2.windows.x64.checked.mch	+0.66%
libraries.pmi.windows.x64.checked.mch	+0.62%
libraries_tests.pmi.windows.x64.checked.mch	+0.43%

FullOpts (+0.11% to +0.16%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.12%
aspnet_block.run.windows.x64.checked.mch	+0.13%
benchmarks.run.windows.x64.checked.mch	+0.11%
coreclr_tests.run.windows.x64.checked.mch	+0.11%
libraries.crossgen2.windows.x64.checked.mch	+0.16%
libraries.pmi.windows.x64.checked.mch	+0.15%
libraries_tests.pmi.windows.x64.checked.mch	+0.13%

Details

All contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	140,813,652,137	141,129,728,577	+0.22%
aspnet_block.run.windows.x64.checked.mch	28,217,820,407	28,288,450,930	+0.25%
benchmarks.run.windows.x64.checked.mch	55,793,824,417	55,856,927,078	+0.11%
coreclr_tests.run.windows.x64.checked.mch	821,229,287,058	823,456,845,113	+0.27%
libraries.crossgen2.windows.x64.checked.mch	123,855,428,806	124,049,862,818	+0.16%
libraries.pmi.windows.x64.checked.mch	234,485,615,873	234,837,654,741	+0.15%
libraries_tests.pmi.windows.x64.checked.mch	507,142,444,192	507,833,290,175	+0.14%

MinOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	27,529,857,925	27,712,507,266	+0.66%
aspnet_block.run.windows.x64.checked.mch	6,648,727,281	6,691,462,037	+0.64%
benchmarks.run.windows.x64.checked.mch	427,797,145	430,288,887	+0.58%
coreclr_tests.run.windows.x64.checked.mch	367,134,396,157	368,860,772,214	+0.47%
libraries.crossgen2.windows.x64.checked.mch	1,712,897	1,724,206	+0.66%
libraries.pmi.windows.x64.checked.mch	1,337,982,864	1,346,240,014	+0.62%
libraries_tests.pmi.windows.x64.checked.mch	5,373,442,496	5,396,620,061	+0.43%

FullOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
aspnet.run.windows.x64.checked.mch	113,283,794,212	113,417,221,311	+0.12%
aspnet_block.run.windows.x64.checked.mch	21,569,093,126	21,596,988,893	+0.13%
benchmarks.run.windows.x64.checked.mch	55,366,027,272	55,426,638,191	+0.11%
coreclr_tests.run.windows.x64.checked.mch	454,094,890,901	454,596,072,899	+0.11%
libraries.crossgen2.windows.x64.checked.mch	123,853,715,909	124,048,138,612	+0.16%
libraries.pmi.windows.x64.checked.mch	233,147,633,009	233,491,414,727	+0.15%
libraries_tests.pmi.windows.x64.checked.mch	501,769,001,696	502,436,670,114	+0.13%

Throughput impact on Windows x86

The following shows the impact on throughput in terms of number of instructions executed inside the JIT. Negative percentages/lower numbers are better.

linux arm

Overall (-0.01% to -0.00%)

Collection	PDIFF
libraries.crossgen2.linux.arm.checked.mch	-0.01%

MinOpts (-0.03% to -0.00%)

Collection	PDIFF
benchmarks.run.linux.arm.checked.mch	-0.02%
libraries.crossgen2.linux.arm.checked.mch	-0.03%
libraries.pmi.linux.arm.checked.mch	-0.01%

FullOpts (-0.01% to -0.00%)

Collection	PDIFF
libraries.crossgen2.linux.arm.checked.mch	-0.01%

Details

All contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.linux.arm.checked.mch	88,925,509,972	88,923,283,376	-0.00%
coreclr_tests.run.linux.arm.checked.mch	1,077,827,852,065	1,077,805,328,402	-0.00%
libraries.crossgen2.linux.arm.checked.mch	88,933,986,476	88,928,563,121	-0.01%
libraries.pmi.linux.arm.checked.mch	243,437,368,111	243,426,162,137	-0.00%
libraries_tests.pmi.linux.arm.checked.mch	508,087,880,991	508,072,032,434	-0.00%

MinOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.linux.arm.checked.mch	1,661,375,568	1,660,964,746	-0.02%
coreclr_tests.run.linux.arm.checked.mch	443,043,133,437	443,030,285,490	-0.00%
libraries.crossgen2.linux.arm.checked.mch	2,144,707	2,144,083	-0.03%
libraries.pmi.linux.arm.checked.mch	1,656,826,419	1,656,621,067	-0.01%
libraries_tests.pmi.linux.arm.checked.mch	7,321,475,376	7,321,117,500	-0.00%

FullOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.linux.arm.checked.mch	87,264,134,404	87,262,318,630	-0.00%
coreclr_tests.run.linux.arm.checked.mch	634,784,718,628	634,775,042,912	-0.00%
libraries.crossgen2.linux.arm.checked.mch	88,931,841,769	88,926,419,038	-0.01%
libraries.pmi.linux.arm.checked.mch	241,780,541,692	241,769,541,070	-0.00%
libraries_tests.pmi.linux.arm.checked.mch	500,766,405,615	500,750,914,934	-0.00%

windows x86

Overall (+0.06% to +0.10%)

Collection	PDIFF
benchmarks.run.windows.x86.checked.mch	+0.07%
coreclr_tests.run.windows.x86.checked.mch	+0.10%
libraries.crossgen2.windows.x86.checked.mch	+0.08%
libraries.pmi.windows.x86.checked.mch	+0.08%
libraries_tests.pmi.windows.x86.checked.mch	+0.06%

MinOpts (+0.15% to +0.28%)

Collection	PDIFF
benchmarks.run.windows.x86.checked.mch	+0.23%
coreclr_tests.run.windows.x86.checked.mch	+0.20%
libraries.crossgen2.windows.x86.checked.mch	+0.28%
libraries.pmi.windows.x86.checked.mch	+0.24%
libraries_tests.pmi.windows.x86.checked.mch	+0.15%

FullOpts (+0.05% to +0.08%)

Collection	PDIFF
benchmarks.run.windows.x86.checked.mch	+0.06%
coreclr_tests.run.windows.x86.checked.mch	+0.05%
libraries.crossgen2.windows.x86.checked.mch	+0.08%
libraries.pmi.windows.x86.checked.mch	+0.08%
libraries_tests.pmi.windows.x86.checked.mch	+0.06%

Details

All contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.windows.x86.checked.mch	63,484,806,822	63,526,339,468	+0.07%
coreclr_tests.run.windows.x86.checked.mch	869,224,259,905	870,134,426,998	+0.10%
libraries.crossgen2.windows.x86.checked.mch	129,233,207,301	129,339,015,895	+0.08%
libraries.pmi.windows.x86.checked.mch	263,408,433,055	263,621,679,397	+0.08%
libraries_tests.pmi.windows.x86.checked.mch	547,530,257,762	547,879,527,062	+0.06%

MinOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.windows.x86.checked.mch	413,217,394	414,153,173	+0.23%
coreclr_tests.run.windows.x86.checked.mch	309,589,082,449	310,208,578,985	+0.20%
libraries.crossgen2.windows.x86.checked.mch	2,103,307	2,109,257	+0.28%
libraries.pmi.windows.x86.checked.mch	1,313,693,745	1,316,901,905	+0.24%
libraries_tests.pmi.windows.x86.checked.mch	5,375,510,184	5,383,557,156	+0.15%

FullOpts contexts:

Collection	Base # instructions	Diff # instructions	PDIFF
benchmarks.run.windows.x86.checked.mch	63,071,589,428	63,112,186,295	+0.06%
coreclr_tests.run.windows.x86.checked.mch	559,635,177,456	559,925,848,013	+0.05%
libraries.crossgen2.windows.x86.checked.mch	129,231,103,994	129,336,906,638	+0.08%
libraries.pmi.windows.x86.checked.mch	262,094,739,310	262,304,777,492	+0.08%
libraries_tests.pmi.windows.x86.checked.mch	542,154,747,578	542,495,969,906	+0.06%

src/coreclr/jit/lsra.cpp

tannergooding · 2023-03-08T17:17:35Z

src/coreclr/jit/lsra.cpp

+        else if (thisType == TYP_MASK)
+        {
+            availableRegs[i] = &availableMaskRegs;
+        }


I think this is a general case like the other where integers are the most common, but they're at the bottom of the checked cases for the loop.

Presumably changing this to either a switch or to prioritize setting availableIntRegs as the first path would be beneficial so we don't have "one extra check" before hitting the path.

Actually, looking at this, this entire thing is just very unnecessarily expensive. We could track the classification in the typelist.h so we could do:

#define DEF_TP(tn, nm, jitType, verType, sz, sze, asze, st, al, tf, ar) availableRegs[static_cast<int>(TYP_##tn)] = &ar; #include "typelist.h" #undef DEF_TP TYP_COUNT

This would involve no loops, no complex checks as to type kind/etc.

@kunalspathak, thoughts? Probably not that impactful overall, since its just part of the LSRA constructor; but still seems "meaningful" given we have around 25 TYP_* defines and this is doing 4-5 checks to initialize 17 of them and 3-4 checks to initialize 6 others.

-- Not saying this needs to be handled in this PR, its something we can do as a follow up if its something we want to do.

I have updated #83109 to take care of it.

tannergooding · 2023-03-08T17:24:34Z

src/coreclr/jit/lsra.h

+    PhasedVar<regMaskTP> availableFloatRegs;
+    PhasedVar<regMaskTP> availableDoubleRegs;


@kunalspathak, it's not clear why we have availableFloatRegs and availableDoubleRegs. Could you elaborate a bit?

Given that availableDoubleRegs is used to account for both double and TYP_SIMD and that float/double/simd are all the "same registers", it's not immediately clear why we have the split.

Even on Arm32 where you have 1x double register taking up 2x float registers, it seems like it'd be simpler/easier to track this as a single "register file" and for double to simply take up 2 bits.

-- Not something to handle in this PR, more making observations and asking questions as this code is being touched in general and looking for potential opportunities to reduce the impact of changes like this, particularly given how impactful LSRA changes are to overall TP.

Honestly, I don't know historical reason behind having 2 separate variables for floats and double although we have #define RBM_ALLDOUBLE RBM_ALLFLOAT in all targets except ARM. I will have to spend some time in understanding the implication if we unify them.

anthonycanino · 2023-03-10T17:17:55Z

Does this look good now? The throughput numbers seem pretty good.

BruceForstall · 2023-03-10T18:28:32Z

Diffs -- looks good.

BruceForstall

LGTM. Would like threadsuspend.cpp removed from the change.

Would like @tannergooding to also review.

src/coreclr/vm/threadsuspend.cpp

src/coreclr/jit/instrsxarch.h

anthonycanino · 2023-03-10T23:04:36Z

I think I've addressed all issues?

tannergooding · 2023-03-10T23:30:32Z

Will get this reviewed a bit later tonight and should be able to get it merged at that time.

tannergooding · 2023-03-12T15:53:18Z

src/coreclr/jit/emitxarch.cpp

+// Technically, K instructions would be considered under the VEX encoding umbrella, but due to
+// the instruction table encoding had to be pulled out with the rest of the `INST5` definitions.


Comment needs to be updated slightly since KInstructions are part of the INST3 group now.

tannergooding · 2023-03-12T15:53:26Z

src/coreclr/jit/emitxarch.cpp

@@ -46,7 +53,7 @@ bool emitter::IsSSEOrAVXInstruction(instruction ins)
 bool emitter::IsAvx512OrPriorInstruction(instruction ins)
 {
    // TODO-XArch-AVX512: Fix check once AVX512 instructions are added.
-    return (ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION);
+    return ((ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION));


nit:

Suggested change

return ((ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION));

return (ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION);

Stale review

ghost added the community-contribution Indicates that the PR has been added by a community member label Feb 27, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 27, 2023

kunalspathak previously requested changes Feb 28, 2023

View reviewed changes

ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Feb 28, 2023

build-analysis bot mentioned this pull request Mar 2, 2023

Tracking issue for CI build timeouts #76454

Closed

DeepakRajendrakumaran mentioned this pull request Mar 3, 2023

Initial support for zmm in .NET #80960

Merged

kunalspathak reviewed Mar 4, 2023

View reviewed changes

src/coreclr/jit/hwintrinsiccodegenxarch.cpp Outdated Show resolved Hide resolved

anthonycanino added 5 commits March 6, 2023 08:59

Add TYP_MASK and Vector512.ExtractMostSignificantBits.

f61cbd2

Rebase / rename error fix.

ff0ac75

Review edits.

e663eb4

Formatting.

dcbc6a9

Review edits.

40a46eb

anthonycanino force-pushed the avx512-extract-most-significant-bits branch from 6cf682d to 40a46eb Compare March 6, 2023 17:21

BruceForstall reviewed Mar 6, 2023

View reviewed changes

Review cleanup.

2468b10

Build fixes.

6340763

build-analysis bot mentioned this pull request Mar 7, 2023

NuGet failing with Response status code does not indicate success: 503 (Service Unavailable) dotnet/arcade#11723

Open

5 tasks

tannergooding reviewed Mar 7, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Mar 7, 2023

View reviewed changes

src/coreclr/jit/lsra.h Outdated Show resolved Hide resolved

kunalspathak mentioned this pull request Mar 7, 2023

Segregate the register iterations based on intType/floatType #83109

Closed

2 tasks

Address throughput issues pertaining to availableRegCount.

41c4b14

tannergooding reviewed Mar 8, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Mar 8, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Mar 8, 2023

View reviewed changes

kmov RR refactor.

3fb7f78

BruceForstall approved these changes Mar 10, 2023

View reviewed changes

src/coreclr/vm/threadsuspend.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/instrsxarch.h Outdated Show resolved Hide resolved

anthonycanino added 3 commits March 10, 2023 10:45

Split kmov into kmov_msk and kmov_gpr.

1c19de3

Fix thread.

45f4b90

Review edits.

46c9e27

BruceForstall added the avx512 Related to the AVX-512 architecture label Mar 10, 2023

tannergooding reviewed Mar 12, 2023

View reviewed changes

tannergooding approved these changes Mar 12, 2023

View reviewed changes

tannergooding merged commit 8599161 into dotnet:main Mar 12, 2023

teo-tsirpanis mentioned this pull request Mar 12, 2023

Implement light-weight K register ExtractMostSignificantBits for Vector512 #80820

Closed

EgorBo mentioned this pull request Mar 13, 2023

jitstress-random: Assertion failed 'node->OperIsLIR()' #83332

Closed

kotlarmilos mentioned this pull request Mar 14, 2023

[Perf] Linux/x64: 1 Regression on 3/12/2023 3:57:11 PM dotnet/perf-autofiling-issues#13993

Closed

tannergooding mentioned this pull request Mar 15, 2023

Change LSRA to more efficiently initialize the availableRegs array #83479

Merged

kotlarmilos mentioned this pull request Mar 23, 2023

[Perf] Linux/arm64: 2 Regressions on 3/12/2023 3:57:11 PM dotnet/perf-autofiling-issues#14286

Closed

ghost locked as resolved and limited conversation to collaborators Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avx512 extract most significant bits #82731

Avx512 extract most significant bits #82731

anthonycanino commented Feb 27, 2023

ghost commented Feb 27, 2023

kunalspathak left a comment

anthonycanino commented Mar 3, 2023

anthonycanino commented Mar 4, 2023

kunalspathak commented Mar 4, 2023

BruceForstall commented Mar 6, 2023

anthonycanino commented Mar 6, 2023

kunalspathak commented Mar 6, 2023

anthonycanino commented Mar 6, 2023

BruceForstall Mar 6, 2023

tannergooding Mar 7, 2023

anthonycanino commented Mar 7, 2023

anthonycanino commented Mar 7, 2023

BruceForstall commented Mar 7, 2023

anthonycanino commented Mar 7, 2023

BruceForstall commented Mar 8, 2023

tannergooding Mar 8, 2023

tannergooding Mar 8, 2023

kunalspathak Mar 8, 2023

tannergooding Mar 8, 2023

tannergooding Mar 8, 2023 •

edited

Loading

kunalspathak Mar 8, 2023

anthonycanino commented Mar 10, 2023

BruceForstall commented Mar 10, 2023

BruceForstall left a comment

anthonycanino commented Mar 10, 2023

tannergooding commented Mar 10, 2023

tannergooding Mar 12, 2023

tannergooding Mar 12, 2023

		PhasedVar<regMaskTP> availableFloatRegs;
		PhasedVar<regMaskTP> availableDoubleRegs;

		// Technically, K instructions would be considered under the VEX encoding umbrella, but due to
		// the instruction table encoding had to be pulled out with the rest of the `INST5` definitions.

	return ((ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION));
	return (ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION);

Avx512 extract most significant bits #82731

Avx512 extract most significant bits #82731

Conversation

anthonycanino commented Feb 27, 2023

ghost commented Feb 27, 2023

kunalspathak left a comment

Choose a reason for hiding this comment

anthonycanino commented Mar 3, 2023

anthonycanino commented Mar 4, 2023

kunalspathak commented Mar 4, 2023

BruceForstall commented Mar 6, 2023

anthonycanino commented Mar 6, 2023

kunalspathak commented Mar 6, 2023

anthonycanino commented Mar 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anthonycanino commented Mar 7, 2023

anthonycanino commented Mar 7, 2023

BruceForstall commented Mar 7, 2023

anthonycanino commented Mar 7, 2023

BruceForstall commented Mar 8, 2023

Throughput impact on Windows x64

linux arm64

osx arm64

linux x64

windows arm64

windows x64

Throughput impact on Windows x86

linux arm

windows x86

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anthonycanino commented Mar 10, 2023

BruceForstall commented Mar 10, 2023

BruceForstall left a comment

Choose a reason for hiding this comment

anthonycanino commented Mar 10, 2023

tannergooding commented Mar 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Mar 8, 2023 •

edited

Loading