Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Avoid mod operator when fast alternative available #27299

Merged
merged 12 commits into from
Oct 26, 2019

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Oct 19, 2019

Use fastmod when 128bit multiply is available. (by @lemire https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/)

Api for review dotnet/corefx#41822 to enable 128bit multiply and make it more efficient.

Other half of PR from #27149 (first half was #27195)

|                       Method |   Toolchain |      Mean | Ratio |
|----------------------------- |------------ |----------:|------:|
|    TryGetValue_17_Int_Object |        base |  8.427 ns |  1.00 |
|    TryGetValue_17_Int_Object |        diff |  7.070 ns |  0.84 |
|    TryGetValue_3k_Int_Object |        base |  7.611 ns |  1.00 |
|    TryGetValue_3k_Int_Object |        diff |  6.635 ns |  0.87 |
|    TryGetValue_1M_Int_Object |        base |  7.743 ns |  1.00 |
|    TryGetValue_1M_Int_Object |        diff |  6.803 ns |  0.88 |
|                              |             |           |       |
| TryGetValue_17_String_Object |        base | 16.233 ns |  1.00 |
| TryGetValue_17_String_Object |        diff | 14.775 ns |  0.91 |
| TryGetValue_3k_String_Object |        base | 22.777 ns |  1.00 |
| TryGetValue_3k_String_Object |        diff | 21.387 ns |  0.94 |
| TryGetValue_1M_String_Object |        base | 74.274 ns |  1.00 |
| TryGetValue_1M_String_Object |        diff | 72.219 ns |  0.97 |
|                              |             |           |       |
| TryGetValue_17_Object_Object |        base | 17.341 ns |  1.00 |
| TryGetValue_17_Object_Object |        diff | 17.396 ns |  1.01 |
| TryGetValue_3k_Object_Object |        base | 24.407 ns |  1.00 |
| TryGetValue_3k_Object_Object |        diff | 22.979 ns |  0.94 |
| TryGetValue_1M_Object_Object |        base | 96.327 ns |  1.00 |
| TryGetValue_1M_Object_Object |        diff | 88.482 ns |  0.92 |

KNucleotide

|        Method | Toolchain |     Mean | Ratio |
|-------------- |---------- |---------:|------:|
| KNucleotide_1 |      base | 378.8 ms |  1.00 |
| KNucleotide_1 |      diff | 352.4 ms |  0.93 |

/cc @jkotas @AntonLapounov @GrabYourPitchforks @AndyAyersMS @danmosemsft

@danmoseley
Copy link
Member

Seems you tried various @benadams, what range of results did you get
https://gist.github.com/benaadams/b3ba5e45c1f0a13b649296a9660ea960

(second hit when I searched for the hex number!)

@benaadams
Copy link
Member Author

😄 They were mostly all terrible as they kept the same order of where the changes were happening (least significant bits); except Crc32 which mixed them up. Reversing the bits also worked well, but there is no fast way to do that on x86.

Summary is:

modulo is chopping off all the most significant bits; so it emphasises small changes; but large changes dissappear.

fastrange is chopping of the least significant bits; so it emphasises large changes; but small changes dissappear.

.NET simple hashes (like int where its 1:1) are small changes (sequential numbers); so to get similar behaviour to modulo from fastrange we need to reverse the bit order. That's not very fast (except on arm), however...

Crc32 is a mixer and scrambles the bits; as a bonus the reflected poly also reverses them (though that doesn't seem to be significant); and its fast (1 cycle, 3 cycle latency); so that also works.

@benaadams
Copy link
Member Author

benaadams commented Oct 19, 2019

Note: good hashcodes e.g. Marvin (randomised string) and System.HashCode (xxHash32) don't need the Crc32 to work with fastrange; however branching based on hashcode type would take longer than just using Crc32. (also wouldn't detect the type for the infinite variety of user hashcodes)

danmoseley
danmoseley previously approved these changes Oct 19, 2019
@danmoseley
Copy link
Member

I guess to cover the modulo path we will rely on the ARM test runs for now. https://github.com/dotnet/corefx/issues/36113

@danmoseley
Copy link
Member

You ran all the dictionary perf tests I guess? I suppose Insert is essentially the same. @adamsitnik any reservations about our perf test coverage of dictionary?

@danmoseley danmoseley requested a review from jkotas October 19, 2019 05:21
@benaadams
Copy link
Member Author

Yes, also added a new set dotnet/performance#938 as @AndyAyersMS pointed out strictly sequential keys weren't being tested; which is problematic since the bucketing was being changed and sequential integers was one of the issues for fastrange by itself.

@benaadams
Copy link
Member Author

One caveat with this change from @GrabYourPitchforks is #27149 (comment)

We'd need to run the CRC32 logic by the crypto board to make sure that we're not improperly compressing the range of the hash function. The final step of CRC32 is a finite field polynomial division, which likely introduces a slight bias toward zero. It's probably safe but needs an extra set of eyes from somebody more steeped in this field.

@jkotas
Copy link
Member

jkotas commented Oct 19, 2019

You ran all the dictionary perf tests I guess?

We also need to run perf tests with tiered JITing off to make sure that we are not regressing Bing and other folks who run in this config. We may need to do a work in crossgen/crossgen2 to handle this pattern well and avoid regressions in this config.

@benaadams benaadams force-pushed the dict-divisor branch 2 times, most recently from a16535d to 1f00260 Compare October 19, 2019 11:22
@danmoseley danmoseley dismissed their stale review October 19, 2019 13:14

Need measurements mentioned by Jan

@benaadams
Copy link
Member Author

Needs more investigation; gets worse when there are 1M (int) keys

|                    Method | Toolchain |      Mean | Ratio |
|-------------------------- |---------- |----------:|------:|
| TryGetValue_17_Int_Object |      base |  8.160 ns |  1.00 |
| TryGetValue_17_Int_Object |      diff |  5.763 ns |  0.71 |
|                           |           |           |       |
| TryGetValue_3k_Int_Object |      base |  8.213 ns |  1.00 |
| TryGetValue_3k_Int_Object |      diff |  7.236 ns |  0.88 |
|                           |           |           |       |
| TryGetValue_1M_Int_Object |      base |  8.323 ns |  1.00 |
| TryGetValue_1M_Int_Object |      diff | 25.507 ns |  3.07 |

Can I get a [no-merge] label?

@stephentoub stephentoub added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Oct 19, 2019
@AntonLapounov
Copy link
Member

AntonLapounov commented Oct 20, 2019

TryGetValue_1M_Int_Object

@benaadams Are you basically comparing sequential memory access (base) to random memory access (diff)? Sequential memory access will always be faster due to hardware prefetching. Try looking up pseudorandom keys for both 'base' and 'diff'.

@NinoFloris
Copy link

@benaadams did you try fibonacci hashing too? I found it a pretty insightful blog https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/

Fastrange by itself really isn't great for the reason you described, it's throwing away lower bits. Fastrange + crc is more expensive than fibonacci, fibonacci has some issue in the last upper bit (so sort of mirrors modulos upper bits behavior better) and finally fibonacci should actually get better with larger tables.

Worth a shot?

(Implementation https://github.com/skarupke/flat_hash_map/blob/812aede752d789033a7c439e8fd4f8d81522a642/flat_hash_map.hpp#L1270-L1273)

@AntonLapounov
Copy link
Member

Fastrange + crc is more expensive than fibonacci

This is not comparing apples to apples. If the size of the hash table is not a power of two, then Fibonacci alone won't work. And if the size is a power of two, Fastrange is not needed.

@benaadams
Copy link
Member Author

Could change to fastmod which has the advantage of being the exact same behviour as modulo.

Disadvantage is it adds a ulong to Dictionary; and more problematic here is we can still only make use of it behind an intrinsic check as Math. doesn't expose a 64bit * 64bit => 128bit multiply https://github.com/dotnet/corefx/issues/41822

@danmoseley
Copy link
Member

Cc @tannergooding for 128 bit multiply ask.

@tannergooding
Copy link
Member

Cc @tannergooding for 128 bit multiply ask.

Ben already opened an issue for that here: https://github.com/dotnet/corefx/issues/41822. We should hopefully get it marked ready-for-review shortly :)

@benaadams
Copy link
Member Author

Doesn't need the No Merge tag any more as not changing the behaviour (whether we want to merge it is a different thing of course 😉 )

@benaadams
Copy link
Member Author

K, back on track

|                       Method | Toolchain |      Mean | Ratio |
|----------------------------- |---------- |----------:|------:|
|    TryGetValue_17_Int_Object |      base |  8.938 ns |  1.00 |
|    TryGetValue_17_Int_Object |      diff |  6.229 ns |  0.69 |
|                              |           |           |       |
|    TryGetValue_3k_Int_Object |      base |  7.657 ns |  1.00 |
|    TryGetValue_3k_Int_Object |      diff |  5.727 ns |  0.75 |
|                              |           |           |       |
|    TryGetValue_1M_Int_Object |      base |  7.721 ns |  1.00 |
|    TryGetValue_1M_Int_Object |      diff |  6.831 ns |  0.86 |
|                              |           |           |       |
| TryGetValue_17_String_Object |      base | 16.263 ns |  1.00 |
| TryGetValue_17_String_Object |      diff | 13.354 ns |  0.82 |
|                              |           |           |       |
| TryGetValue_3k_String_Object |      base | 22.804 ns |  1.00 |
| TryGetValue_3k_String_Object |      diff | 19.498 ns |  0.86 |
|                              |           |           |       |
| TryGetValue_1M_String_Object |      base | 79.410 ns |  1.00 |
| TryGetValue_1M_String_Object |      diff | 69.598 ns |  0.88 |
|                              |           |           |       |
| TryGetValue_17_Object_Object |      base | 17.572 ns |  1.00 |
| TryGetValue_17_Object_Object |      diff | 15.245 ns |  0.87 |
|                              |           |           |       |
| TryGetValue_3k_Object_Object |      base | 24.880 ns |  1.00 |
| TryGetValue_3k_Object_Object |      diff | 22.257 ns |  0.89 |
|                              |           |           |       |
| TryGetValue_1M_Object_Object |      base | 96.808 ns |  1.00 |
| TryGetValue_1M_Object_Object |      diff | 87.274 ns |  0.90 |

@benaadams
Copy link
Member Author

|        Method | Toolchain |     Mean | Ratio |
|-------------- |---------- |---------:|------:|
| KNucleotide_1 |      base | 369.8 ms |  1.00 |
| KNucleotide_1 |      diff | 337.5 ms |  0.91 |
| KNucleotide_9 |      base | 91.43 ms |  1.00 |
| KNucleotide_9 |      diff | 88.44 ms |  0.97 |

@benaadams
Copy link
Member Author

I am assuming dotnet/corefx#41822 will solve the issue with the intrinsic test in the R2R when Tiered Jitting is disabled as it will become a normal multiply?

Copy link
Member

@AntonLapounov AntonLapounov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the issue in Resize, changes look good to me. This should benefit most applications, though may lead to increased memory consumption in applications creating lots of small Dictionary instances.

@danmoseley
Copy link
Member

increased memory consumption in applications creating lots of small Dictionary instances.

We did remove a similar sized field last release (a ref for SyncRoot). I guess it cancels out...

Co-Authored-By: Jan Kotas <jkotas@microsoft.com>
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - nice work!

@jkotas jkotas merged commit e532bf6 into dotnet:master Oct 26, 2019
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corefx that referenced this pull request Oct 26, 2019
)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corert that referenced this pull request Oct 26, 2019
)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/mono that referenced this pull request Oct 26, 2019
)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@benaadams benaadams deleted the dict-divisor branch October 26, 2019 14:22
jkotas pushed a commit to dotnet/corert that referenced this pull request Oct 26, 2019
)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
marek-safar pushed a commit to mono/mono that referenced this pull request Oct 26, 2019
)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
stephentoub pushed a commit to dotnet/corefx that referenced this pull request Oct 26, 2019
)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@GrabYourPitchforks
Copy link
Member

If there is concern about memory usage, one thing you could do is to move rarely-accessed data into a separate instance. For example:

public class Dictionary<TKey, TValue>
{
    private KeyCollection _keys; // REMOVE me
    private ValueCollection _values; // REMOVE me

    private RareMembers _rareMembers; // INTRODUCE me

    private sealed class RareMembers
    {
        internal KeyCollection Keys;
        internal ValueCollection Values;
    }
}

It saves 1 reference in any Dictionary<TKey, TValue> instance where the Keys or Values property isn't accessed. (I'd wager that describes most dictionary instances.) The trade-off is that if you do access either of those members there's a bit more indirection and a bit more memory usage.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants