-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Enable a faster mod/GetCurrentProcessorId #27588
Conversation
Any performance numbers? |
Moved public static int GetCurrentProcessorId()
{
// TODO: Implement correctly
return Environment.CurrentManagedThreadId;
} So the change it would need is dropping that method and adding public static int GetCurrentProcessorNumber()
{
// TODO: Implement correctly
return -1;
} And it will pickup the benefit from the caching of the mod value /cc @filipnavara @marek-safar |
if (currentProcessorId < 0) currentProcessorId = Environment.CurrentManagedThreadId; | ||
|
||
// Ensure the Id is in range of the ProcessorCount | ||
currentProcessorId = (int)((uint)currentProcessorId % (uint)Environment.ProcessorCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe in a VM proc Id could be above ProcessorCount. Do we care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you as to whether its accurate vs in range 0-ProcCount?
The first means a mod is needed for each use; the second a mod every 5000 calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another concern is - is this cheap? The number of cores could be pretty odd nowdays.
I generally round up the number of stripes to the next ^2 - so that I could use binary masking which is much cheaper.
In that case some slots may not be used due to the oddness of the proc number (or due to affinity), but that is never a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the mod will be changed to not be a mod at Tier1 as the divisor will be a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the second a mod every 5000 calls.
@VSadov has done a number of experiments around this and we believe that this limit should be <50.
On more recent hardware, none of this TLS caching should be needed and this should just call RDPID
instruction that is both fastest and most accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the exact coreid is important over an in range value it would be better to expose RDPID
or similar via intrinsics to expose that? Then you can check the .IsSupported
flags, do something different for Arm vs Intel etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CurrentProcessorId
is platform neutral concept. I do not think we need to have yet another set of hardware intrinsics for it.
it also currently adds 100 to it
I would be ok with dropping this and making this API to return the underlying Processor ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to push the mod bypasses out into the callsites rather than in GetCurrentProcessorId
CurrentProcessorId
is platform neutral concept.
However, my point is it doesn't return a platform neutral value; its platform specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could drop the +100
part.
It is not a big problem for reducer though. Until you have 100+ cores in the system any method of reducing - modding, masking, hashing, should work the same.
After 100+ cores, yes, it could be a bit of a problem for those who use masking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just changed the call sites in #27600 if that is preferred?
Good catch with the missing Mono implementation. I will look into it. |
It returns -1 for some Unixes also; this change means Mono will have the same caching behaviour as coreclr |
Using #27600 in preference |
Ensure
GetCurrentProcessorId()
is in range 0 - ProcCount as with #27543 ProcCount is now a stable value.In
TimerQueueTimer
useGetCurrentProcessorId()
directly as it no longer needs mod.In
TlsOverPerCoreLockedStacksArrayPool
use a pattern that can be elided by Tier1 Jitting and skip mod if ProcCount < 64Originally I was aiming to use
FastMod
; however I realised the Jit would do it itself at Tier1 as thereadonly static
s would becomeconst
s/cc @stephentoub @jkotas @VSadov