-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize HexConverter.EncodeToUtf16 using SSSE3 #44111
Conversation
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
FWIW, I believe the common use cases for this API are when the input is <= 32 bytes, and occasionally 64 bytes. The number of use cases past that point are vanishingly small. Egor's table doesn't include entries for 32 or 64 bytes, but interpolating the data suggests SSSE3-intrinsicified should see 3x throughput compared to baseline. |
Co-authored-by: Adeel Mujahid <3840695+am11@users.noreply.github.com>
Does it look good? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left some really low-pri comments, feel free to ignore as you see fit. Thanks for driving this. :)
@stephentoub @jkotas can this be merged? |
What are the number for these public APIs? |
|
||
// The high bytes (0x00) of the chars have also been converted | ||
// to ascii hex '0', so clear them out. | ||
hex = Sse2.And(hex, Vector128.Create((ushort)0xFF).AsByte()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea for micro-optimization.
Vector.Create
creates the constant, that needs to be read from memory. With the other "constants" used here, it's likely they will be brought together into L1D, so the cost should be low.
But it still evicts a cacheline from L1D's set, so one could avoid this load by using bit-shifting for the masking. Something like
// (pseudo code style)
tmp = Sse2.ShiftLeftLogical(vec, 8)
Sse2.ShiftRightLogical(tmp, 8)
Latency and throughput for the shift is good, but this introduces a register dependency.
In micro-benchmarks the L1D will be hot, so it won't harm here, but in real usages that's likely not the case, so avoiding the load may be a plus.
TBH I don't know if it's worth it and so far it's more theory...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea! However, I don't see any differences in benchmarks (even slightly slower).
Sse2.And(hex, Vector128.Create((ushort)0xFF))
is emitted as vpand xmm0, xmm0, xmmword ptr [reloc @RWD00]
(memory load without an additional register)
It probably makes sense to hoist them from loop too but it also doesn't affect the benchmarks (almost).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't affect the benchmarks
L1D is hot, so I expected this.
For real world usage I doubt the benchmarks will show a measurable difference too.
But according the theory a different cacheline could be kept in the sets of L1D, as it's not evicted, which could be good in the overall.
To show this in a benchmark one would need the address of the @RWD00
so the cache set is known (with cpu data from cpu-z, etc.). Then load from the address and other data(s), that "fight" for the same cache set, so this set is evicted constantly. But this is almost impossible to do reliable in a benchmark.
The main goal was to optimize the new APIs, I couldn't find any noticeable difference for others,
|
@jeffhandley did you mean to assign this to me? @EgorBo are we waiting for anything else before merge? |
@jeffhandley @GrabYourPitchforks it's finished from my end |
Were you just waiting for another signoff then, @EgorBo? |
Ah, I never merged non-mono related PRs 🙂, can I merge it now then? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, cool. Yeah, go for it, @EgorBo.
Based on my dotnet/aspnetcore#18406 (comment) (and improved by @benaadams)
Self-contained benchmark: https://gist.github.com/EgorBo/b0da2cd604f713a7767df907d5a3dfa6
Some of the directly affected APIs:
System.Net.Http.AuthenticationHelper.ComputeHash(string data, string algorithm)
System.Converter.ToHexString(ReadOnlySpan<byte> bytes) -- new API
System.Converter.ToHexString(byte[] inArray) -- new API
Will add more test-cases to
ConvertToHexStringTests