-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Improve performance of span-based ToUpper and related APIs #20275
Improve performance of span-based ToUpper and related APIs #20275
Conversation
{ | ||
*b++ = ToLowerAsciiInvariant(*a++); | ||
length++; | ||
goto NonAsciiSkipTwoChars; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind these gotos is to keep execution flow within the central loop as tight as possible. In the common case (all ASCII data), the CPU just executes through the instructions sequentially; in the uncommon case (non-ASCII data), the CPU evaluates and follows a jcc instruction.
Putting logic like currIdx += 2;
inside the if block would've introduced jmp statements in the common case, which I tried to avoid.
ChangeCaseCommon<TConversion>(ref MemoryMarshal.GetReference(source), ref MemoryMarshal.GetReference(destination), source.Length); | ||
} | ||
|
||
private void ChangeCaseCommon<TConversion>(ref char source, ref char destination, int charCount) where TConversion : struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 4 parameters (this
+ 3 explicit) allows x64 calling convention to enregister all parameters without stack-spilling.
src/System.Private.CoreLib/shared/Internal/Runtime/CompilerServices/Unsafe.cs
Outdated
Show resolved
Hide resolved
internal static partial class Utf16Utility | ||
{ | ||
/// <summary> | ||
/// Returns true iff the DWORD represents two ASCII UTF-16 characters in machine endianness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DWORD
is Windowsism. This should say uint or UInt32.
} | ||
|
||
fixed (char* pSource = &MemoryMarshal.GetReference(source)) | ||
fixed (char* pResult = &MemoryMarshal.GetReference(destination)) | ||
if (IsAsciiCasingSameAsInvariant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a desire to do so, we can refactor this "Boolean that might not be initialized" logic out into a standalone type. In an ideal world, the codegen would look like the following.
cmp byte ptr [ptr_to_field], 1
jl LABEL_NotTrue
LABEL_True:
.. logic where value evaluated to true ..
LABEL_NotTrue:
cmp byte ptr [ptr_to_field], 0
jl LABEL_NeedToEvaluate
LABEL_False:
.. logic where value evaluated to false ..
LABEL_NeedToEvaluate:
call Evaluate
cmp byte ptr [ptr_to_field], 0
ja LABEL_True
jmp LABEL_False
This means that in the common case of the value being already-evaluated and equal to true, no jumps at all are taken.
In the less common case of the value being already-evaluated and equal to false, only a single jcc is taken.
And in the uncommon case of the value not yet being evaluated, multiple jumps are taken.
} | ||
|
||
fixed (char* pSource = &MemoryMarshal.GetReference(source)) | ||
fixed (char* pResult = &MemoryMarshal.GetReference(destination)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the minor saving from avoiding the pinning really worth it here? The byref-arithmetic is very hard to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't affect benchmark perf significantly. This is mainly to make the GC happier by keeping fewer objects pinned just in case a GC occurs.
Is the concern the syntax? We could add an internal method similar to the following, which would make the call sites look much nicer, while minimizing the number of pinned objects.
// new internal API on Unsafe class
static internal ref T ReadUnaligned<T>(ref T @base, nuint index, nuint displacementBytes) { /* ... */ }
// old call site
// mov tmp, dword ptr [source + 2 * currIdx + 4]
tempValue = Unsafe.ReadUnaligned<uint>(ref Unsafe.As<char, byte>(ref Unsafe.AddByteOffset(ref Unsafe.Add(ref source, (nint)currIdx), 4)));
// new call site
// mov tmp, dword ptr [source + 2 * currIdx + 4]
tempValue = Unsafe.ReadUnaligned<char, uint>(ref source, (nint)currIdx, 4);
Alternatively, if we're ok with the GC impact of pinning objects more than necessary, we can just go back to raw pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mainly to make the GC happier by keeping fewer objects pinned just in case a GC occurs.
Short-term pinning is not really a problem for the GC. We have it everywhere - the GC is tuned to deal with it well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As expected, the benchmark doesn't change when we go back to using pointers.
I spoke briefly with Maoni and asked for guidance on when it would make sense to use pointers vs. when it would make sense to avoid pinning. She gave a good rule of thumb: try to avoid pinning if you're going to be running your logic for an extended period of time, and try to avoid pinning if you're working with potentially large data sets.
Since this method is quick and since we expect input strings to be small, I agree with your assessment that avoiding pinning doesn't really buy us much. Will revert.
#else // BIT64 | ||
using nuint = System.UInt32; | ||
using nint = System.Int32; | ||
#endif // BIT64 | ||
|
||
namespace System.Globalization | ||
{ | ||
public partial class TextInfo : ICloneable, IDeserializationCallback | ||
{ | ||
private enum Tristate : byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the underlying type rather be sbyte
since you are taking advantage of the negative value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per your other feedback, no longer using the negative value, so this is moot.
src/System.Private.CoreLib/shared/System/Globalization/TextInfo.cs
Outdated
Show resolved
Hide resolved
False, | ||
True = 1, | ||
False = 0, | ||
NotInitialized = 0x80, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need to change NonInitalized to be 0x80? It is kind of nice for NonInitalized to be 0 for flags like these.
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
internal static bool DWordAllCharsAreAscii(uint value) | ||
{ | ||
return (value & ~0x007F007Fu) == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There aren't any endianness issues here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - there are no endianness issues here. The reason is that the underlying char
is already stored as machine-endian.
Consider a string with the UTF-16 code units [ U+AABB U+CCDD ]
. On a big-endian machine, this will be stored in memory as the bytes 0xAA 0xBB 0xCC 0xDD, and reading this as a uint returns 0xAABBCCDD. On a little-endian machine, this will be stored in memory as the bytes 0xBB 0xAA 0xDD 0xCC, and reading this as a uint returns 0xCCDDAABB. Both of these work with the mask provided in this method.
@@ -909,5 +1009,11 @@ private static bool IsLetterCategory(UnicodeCategory uc) | |||
|| uc == UnicodeCategory.ModifierLetter | |||
|| uc == UnicodeCategory.OtherLetter); | |||
} | |||
|
|||
// A dummy struct that is used for 'ToUpper' in generic parameters | |||
private readonly struct ToUpperConversion { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also do this by using aliases for existing types:
using ToUpperConversion = System.UInt32;
using ToLowerConversion = System.Int32;
It is a micro-optimization. Feel free to ignore this if you think explicit types are better.
This PR improves the performance of the span-based
ToUpper
/ToUpperInvariant
/ToLower
/ToLowerInvariant
methods. Perf results are provided in the table below.The testbed generates a random all-ASCII string of the specified length (using a predictable RNG with a constant seed), then calls the span-based
ToUpperInvariant
to write the result into a scratch buffer. The random string has a mix of lowercase and uppercase characters. I also ran the test forToUpper(..., <tr-TR culture>)
andToUpperInvariant(<string containing non-ASCII chars>, ...)
and saw no noticeable difference from the existing in-box implementation, as expected. The logic continues to fall back to the native localization tables in those cases.Various notes for this PR that will make reviewing easier:
TConversion
leverages the JIT's generic specialization logic to output two different codegens: one forToUpper
, one forToLower
. This pattern allows us to get away with having only one copy of this logic in source.IsAsciiCasingSameAsInvariant
having already been computed is inlined into the caller. This saves a method call and some other work in the common case.ref T Unsafe.Add<T>(ref T, nint elementOffset)
(not exposed via the reference assemblies). Having this specific overload ofAdd<T>
makes certain scenarios easier and prevents us from having to perform theAddByteOffset
calculation manually.Utf16Utility
type is meant for bitwise inspection of UTF-16 data. It's a partial port of the type from thefeature/utf8string
feature branch. I considered putting these APIs on an existing type, but no type really stood out as a good candidate. Future PRs will add more APIs to this type. (Thestring.ToUpper
andstring.GetHashCode
PRs add methods to this.)No APIs are introduced by this PR. I do not expect a corresponding corefx PR other than the standard auto-mirror.
There is some slight refactoring introduced here. Other coming PRs (
string.ToUpper
/string.GetHashCode
) ride on top of this refactoring. Much of the upcoming UTF-8 logic also rides on top of this refactoring.