-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A few optimizations for the gcinfodecoder construction #96150
Conversation
src/coreclr/inc/gcinfotypes.h
Outdated
DWORD lzcountCeil; | ||
_BitScanReverse(&lzcountCeil, (unsigned long)x); | ||
#else // _MSC_VER | ||
UINT32 lzcountCeil = (UINT32)__builtin_clz((unsigned int)x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe __builtin_clz already encodes the subtract from BITS_PER_SIZE_T within the intrinsic function unlike _BitScanReverse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I've just realized that even though lzcnt is encoded similarly to bsr on x64, the result is an offset from different ends.
To measure the impact I use the following microbenchmark compiled with NativeAOT The numbers are averaged GC Gen0 pauses in milliseconds. Lower is better. I see ~ 20% improvement. ==== Before the change:
=== After the change:
|
@@ -275,6 +276,7 @@ class BitStreamReader | |||
m_InitialRelPos = other.m_InitialRelPos; | |||
m_pCurrent = other.m_pCurrent; | |||
m_RelPos = other.m_RelPos; | |||
m_current = other.m_current; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 64bit one native word can act as a "buffer" for quite a few reads when each read takes only a few bits. This change reduces the need for indirect reads from the bitstream and may allow the compiler to enregister the "buffer".
// NOTE: This routine is perf-critical | ||
__forceinline size_t ReadOneFast() | ||
{ | ||
SUPPORTS_DAC; | ||
|
||
size_t result = (*m_pCurrent) & (((size_t)1) << m_RelPos); | ||
size_t result = m_current & 1; | ||
m_current >>= 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this change is to use a fixed-size shift, which is typically faster than a variable-sized shift.
Same applies to Read( int numBits )
when we read a fixed sized nibble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you pay for it by extra memory writes. Is it really a win at the end?
For example, here are timings for Intel 11th gen from
https://www.agner.org/optimize/instruction_tables.pdf (page 350):
| SHR SHL SAR r,i | 1 | 1 |
| SHR SHL SAR m,i | 4 | 4 |
| SHR SHL SAR r,cl | 2 | 2 |
| SHR SHL SAR m,cl | 5 | 6 |
Constant shift of memory is twice the cost of variable shift of register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to recheck the original codegen, but I think what was happening is that we would do indirect read and then apply a mask that was constructed via a variable shift of 1
.
I guess that was because we need the result in a register and do not want to change the bit stream and the ways m_pCurrent
and m_RelPos
were changing did not allow to hoist/CSE/enregister either the result of the indirect read nor the computed mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly for Zen3 the table gives no difference whatsoever between immediate and CL shift versions.
SHL, SHR, SAR r,i/CL | 1 | 1 | 0.5
yet profiler finds CL shift operations relatively expensive.
I often see that in other code, so I always assumed that varaible shifts are somewhat costly.
Maybe it is not the instruction itself, but the whole sequence of dependent instructions - load 1, load CL, make a mask, do a read, apply the mask, and sampling profiler just attributes most of that to the shift.
The code does get measurably faster after the change. I had a few other changes that did not improve anything so I did not include them, but this one definitely helped.
c++ inlines and interleaves statement parts quite aggressively, so it is a bit hard to see what belongs where, but I see that a few "expensive" CL shifts are gone.
Here is an example of what happened to a loop like:
for(UINT32 i = 0; i < numSlots; i++)
{
if(m_Reader.ReadOneFast())
numCouldBeLiveSlots++;
}
It is not an example of expensive/hot code, just something that is easier to read and see how codegen changed.
==== original code
00007FF700CE9C4E test r13d,r13d
00007FF700CE9C51 je GcInfoDecoder::EnumerateLiveSlots+898h (07FF700CE9C98h)
00007FF700CE9C53 mov r8d,r13d
00007FF700CE9C56 nop word ptr [rax+rax]
00007FF700CE9C60 mov ecx,r15d
00007FF700CE9C63 mov rax,r11 ; r11 has `1` in it, assigned outside of the loop
00007FF700CE9C66 shl rax,cl ; depends on 2 instructions above
00007FF700CE9C69 and rax,qword ptr [rdx] ; depends on instruction above + read (L1 is 3-5 cycles)
00007FF700CE9C6C inc r15d
00007FF700CE9C6F mov dword ptr [rdi+18h],r15d
00007FF700CE9C73 cmp r15d,40h
00007FF700CE9C77 jne GcInfoDecoder::EnumerateLiveSlots+88Bh (07FF700CE9C8Bh)
00007FF700CE9C79 add rdx,8 ; moving to the next word, relatively rare (1 in 64 reads)
00007FF700CE9C7D mov dword ptr [rdi+18h],0
00007FF700CE9C84 mov qword ptr [rdi+10h],rdx
00007FF700CE9C88 xor r15d,r15d
00007FF700CE9C8B test rax,rax
00007FF700CE9C8E je GcInfoDecoder::EnumerateLiveSlots+893h (07FF700CE9C93h)
00007FF700CE9C90 inc r12d
00007FF700CE9C93 sub r8,r11
vs.
==== new
00007FF688D6A454 test r12d,r12d
00007FF688D6A457 je GcInfoDecoder::EnumerateLiveSlots+0C04h (07FF688D6A4B4h)
00007FF688D6A459 mov rax,r11
00007FF688D6A45C mov edx,r12d
00007FF688D6A45F nop
00007FF688D6A460 mov rcx,rax
00007FF688D6A463 inc r8d ; can run together or even before the previous instruction
00007FF688D6A466 shr rax,1 ; this and the next can run at the same time
00007FF688D6A469 and ecx,1 ; both only depend on "mov rcx,rax"
00007FF688D6A46C mov qword ptr [rbx+20h],rax ; we do writes, but we do not need to wait for them
00007FF688D6A470 mov dword ptr [rbx+18h],r8d
00007FF688D6A474 mov qword ptr [rsp+48h],rax ; not sure what is stored here, in other cases we have just 2 writes
00007FF688D6A479 cmp r8d,40h
00007FF688D6A47D jne GcInfoDecoder::EnumerateLiveSlots+0BEBh (07FF688D6A49Bh)
00007FF688D6A47F add qword ptr [rbx+10h],8 ; moving to the next word, relatively rare (1 in 64 reads)
00007FF688D6A484 mov rax,qword ptr [rbx+10h]
00007FF688D6A488 xor r8d,r8d
00007FF688D6A48B mov rax,qword ptr [rax]
00007FF688D6A48E mov qword ptr [rbx+20h],rax
00007FF688D6A492 mov qword ptr [rsp+48h],rax
00007FF688D6A497 mov dword ptr [rbx+18h],r8d
00007FF688D6A49B test rcx,rcx
00007FF688D6A49E je GcInfoDecoder::EnumerateLiveSlots+0BF3h (07FF688D6A4A3h)
00007FF688D6A4A0 inc r13d
00007FF688D6A4A3 sub rdx,1
bool m_IsInterruptible; | ||
bool m_IsVarArg; | ||
bool m_GenericSecretParamIsMD; | ||
bool m_GenericSecretParamIsMT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are derived from masking the header flags. Masking is cheap and we may not even be asked for these, so we can do masking in the accessors.
// Use flag mask to bail out early if we already decoded all the pieces that caller requested | ||
int remainingFlags = flags == DECODE_EVERYTHING ? ~0 : flags; | ||
|
||
if (!slimHeader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Predecoding a fat header could be much more involved compared to slim headers, so let's deal with fat headers in a separate helper.
/azp run runtime-nativeaot-outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
I think this is ready for review. |
// NOTE: This routine is perf-critical | ||
__forceinline size_t ReadOneFast() | ||
{ | ||
SUPPORTS_DAC; | ||
|
||
size_t result = (*m_pCurrent) & (((size_t)1) << m_RelPos); | ||
size_t result = m_current & 1; | ||
m_current >>= 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you pay for it by extra memory writes. Is it really a win at the end?
For example, here are timings for Intel 11th gen from
https://www.agner.org/optimize/instruction_tables.pdf (page 350):
| SHR SHL SAR r,i | 1 | 1 |
| SHR SHL SAR m,i | 4 | 4 |
| SHR SHL SAR r,cl | 2 | 2 |
| SHR SHL SAR m,cl | 5 | 6 |
Constant shift of memory is twice the cost of variable shift of register.
src/coreclr/vm/gcinfodecoder.cpp
Outdated
@@ -383,54 +415,81 @@ bool GcInfoDecoder::IsSafePoint(UINT32 codeOffset) | |||
if(m_NumSafePoints == 0) | |||
return false; | |||
|
|||
#if defined(TARGET_AMD64) || defined(TARGET_ARM) || defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removal of this ifdef is changing behavior for Linux x86. I guess that it is a bug fix. I do not see a reason why safepoint handling should be different for Linux x86.
cc @gbalykov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed we can't get here with x86 since we use a different gcinfo encoder.
If that is not the case and we can on Linux, I will put this back.
Overall, I think this -1
hack is trying to solve a problem that does not exist, and it would be better if we could stop doing/undoing this artificial adjustment, but it is not relevant to this PR. I just thought that #if defined
is always true here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas thanks for sharing this. Currently almost all clr tests fail on linux x86 on main branch, so there're other issues with it that need to be investigated first. We'll probably take a look at this change after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you!
It looks like the linux-arm64 Release NativeAOT_Libraries is stuck in a loop repeating the same:
It is not happening in my newer test run (#85694), so perhaps it is something that got fixed recently. |
Thanks!!! |
Constructing gcinfodecoder performs some initial decoding. While the time spent in initial decoding is typically less than the time spent enumerating live slots, it is not insignificant. There are some opportunities to do initial decoding a bit cheaper.