-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Avoid mod operator when fast alternative available #27299
Conversation
src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs
Outdated
Show resolved
Hide resolved
Seems you tried various @benadams, what range of results did you get (second hit when I searched for the hex number!) |
😄 They were mostly all terrible as they kept the same order of where the changes were happening (least significant bits); except Crc32 which mixed them up. Reversing the bits also worked well, but there is no fast way to do that on x86. Summary is: modulo is chopping off all the most significant bits; so it emphasises small changes; but large changes dissappear. fastrange is chopping of the least significant bits; so it emphasises large changes; but small changes dissappear. .NET simple hashes (like Crc32 is a mixer and scrambles the bits; as a bonus the reflected poly also reverses them (though that doesn't seem to be significant); and its fast (1 cycle, 3 cycle latency); so that also works. |
Note: good hashcodes e.g. Marvin (randomised string) and System.HashCode (xxHash32) don't need the Crc32 to work with fastrange; however branching based on hashcode type would take longer than just using Crc32. (also wouldn't detect the type for the infinite variety of user hashcodes) |
I guess to cover the modulo path we will rely on the ARM test runs for now. https://github.com/dotnet/corefx/issues/36113 |
You ran all the dictionary perf tests I guess? I suppose Insert is essentially the same. @adamsitnik any reservations about our perf test coverage of dictionary? |
Yes, also added a new set dotnet/performance#938 as @AndyAyersMS pointed out strictly sequential keys weren't being tested; which is problematic since the bucketing was being changed and sequential integers was one of the issues for fastrange by itself. |
One caveat with this change from @GrabYourPitchforks is #27149 (comment)
|
We also need to run perf tests with tiered JITing off to make sure that we are not regressing Bing and other folks who run in this config. We may need to do a work in crossgen/crossgen2 to handle this pattern well and avoid regressions in this config. |
src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs
Outdated
Show resolved
Hide resolved
a16535d
to
1f00260
Compare
Needs more investigation; gets worse when there are 1M (int) keys
Can I get a [no-merge] label? |
@benaadams Are you basically comparing sequential memory access (base) to random memory access (diff)? Sequential memory access will always be faster due to hardware prefetching. Try looking up pseudorandom keys for both 'base' and 'diff'. |
@benaadams did you try fibonacci hashing too? I found it a pretty insightful blog https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/ Fastrange by itself really isn't great for the reason you described, it's throwing away lower bits. Fastrange + crc is more expensive than fibonacci, fibonacci has some issue in the last upper bit (so sort of mirrors modulos upper bits behavior better) and finally fibonacci should actually get better with larger tables. Worth a shot? (Implementation https://github.com/skarupke/flat_hash_map/blob/812aede752d789033a7c439e8fd4f8d81522a642/flat_hash_map.hpp#L1270-L1273) |
This is not comparing apples to apples. If the size of the hash table is not a power of two, then Fibonacci alone won't work. And if the size is a power of two, Fastrange is not needed. |
1f00260
to
ab30be9
Compare
Could change to fastmod which has the advantage of being the exact same behviour as modulo. Disadvantage is it adds a |
Cc @tannergooding for 128 bit multiply ask. |
Ben already opened an issue for that here: https://github.com/dotnet/corefx/issues/41822. We should hopefully get it marked |
Doesn't need the |
ff3bbda
to
ee71bd6
Compare
K, back on track
|
|
I am assuming dotnet/corefx#41822 will solve the issue with the intrinsic test in the R2R when Tiered Jitting is disabled as it will become a normal multiply? |
src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the issue in Resize
, changes look good to me. This should benefit most applications, though may lead to increased memory consumption in applications creating lots of small Dictionary
instances.
We did remove a similar sized field last release (a ref for SyncRoot). I guess it cancels out... |
src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
Co-Authored-By: Jan Kotas <jkotas@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you - nice work!
If there is concern about memory usage, one thing you could do is to move rarely-accessed data into a separate instance. For example: public class Dictionary<TKey, TValue>
{
private KeyCollection _keys; // REMOVE me
private ValueCollection _values; // REMOVE me
private RareMembers _rareMembers; // INTRODUCE me
private sealed class RareMembers
{
internal KeyCollection Keys;
internal ValueCollection Values;
}
} It saves 1 reference in any |
Use fastmod when 128bit multiply is available. (by @lemire https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/)
Api for review dotnet/corefx#41822 to enable 128bit multiply and make it more efficient.
Other half of PR from #27149 (first half was #27195)
KNucleotide
/cc @jkotas @AntonLapounov @GrabYourPitchforks @AndyAyersMS @danmosemsft