Improve performance of span-based ToUpper and related APIs #20275

GrabYourPitchforks · 2018-10-05T01:15:31Z

This PR improves the performance of the span-based ToUpper / ToUpperInvariant / ToLower / ToLowerInvariant methods. Perf results are provided in the table below.

Method	Toolchain	StringLength	Mean	Error	StdDev	Scaled	ScaledSD
ToUpperInvariant_Ascii	3.0.0-preview1-27004-04	0	26.05 ns	0.5008 ns	0.4684 ns	1.00	0.00
ToUpperInvariant_Ascii	local	0	27.16 ns	0.3857 ns	0.3608 ns	1.04	0.02

ToUpperInvariant_Ascii	3.0.0-preview1-27004-04	4	36.21 ns	0.5745 ns	0.5092 ns	1.00	0.00
ToUpperInvariant_Ascii	local	4	28.94 ns	0.4919 ns	0.4360 ns	0.80	0.02

ToUpperInvariant_Ascii	3.0.0-preview1-27004-04	5	36.27 ns	0.4766 ns	0.3980 ns	1.00	0.00
ToUpperInvariant_Ascii	local	5	29.20 ns	0.4267 ns	0.3783 ns	0.81	0.01

ToUpperInvariant_Ascii	3.0.0-preview1-27004-04	6	36.40 ns	0.3273 ns	0.3061 ns	1.00	0.00
ToUpperInvariant_Ascii	local	6	29.13 ns	0.2871 ns	0.2397 ns	0.80	0.01

ToUpperInvariant_Ascii	3.0.0-preview1-27004-04	8	38.01 ns	0.8254 ns	1.3092 ns	1.00	0.00
ToUpperInvariant_Ascii	local	8	30.76 ns	0.7869 ns	1.0504 ns	0.81	0.04

ToUpperInvariant_Ascii	3.0.0-preview1-27004-04	12	42.06 ns	0.9044 ns	0.8882 ns	1.00	0.00
ToUpperInvariant_Ascii	local	12	35.21 ns	0.5008 ns	0.4684 ns	0.84	0.02

ToUpperInvariant_Ascii	3.0.0-preview1-27004-04	100	133.87 ns	2.9730 ns	6.1398 ns	1.00	0.00
ToUpperInvariant_Ascii	local	100	75.62 ns	1.3335 ns	1.1821 ns	0.57	0.03

ToUpperInvariant_Ascii	3.0.0-preview1-27004-04	1000	1,255.32 ns	24.6229 ns	38.3349 ns	1.00	0.00
ToUpperInvariant_Ascii	local	1000	488.50 ns	5.0715 ns	4.7439 ns	0.39	0.01

The testbed generates a random all-ASCII string of the specified length (using a predictable RNG with a constant seed), then calls the span-based ToUpperInvariant to write the result into a scratch buffer. The random string has a mix of lowercase and uppercase characters. I also ran the test for ToUpper(..., <tr-TR culture>) and ToUpperInvariant(<string containing non-ASCII chars>, ...) and saw no noticeable difference from the existing in-box implementation, as expected. The logic continues to fall back to the native localization tables in those cases.

Various notes for this PR that will make reviewing easier:

TConversion leverages the JIT's generic specialization logic to output two different codegens: one for ToUpper, one for ToLower. This pattern allows us to get away with having only one copy of this logic in source.
The case of IsAsciiCasingSameAsInvariant having already been computed is inlined into the caller. This saves a method call and some other work in the common case.
This adds ref T Unsafe.Add<T>(ref T, nint elementOffset) (not exposed via the reference assemblies). Having this specific overload of Add<T> makes certain scenarios easier and prevents us from having to perform the AddByteOffset calculation manually.
The internal Utf16Utility type is meant for bitwise inspection of UTF-16 data. It's a partial port of the type from the feature/utf8string feature branch. I considered putting these APIs on an existing type, but no type really stood out as a good candidate. Future PRs will add more APIs to this type. (The string.ToUpper and string.GetHashCode PRs add methods to this.)

No APIs are introduced by this PR. I do not expect a corresponding corefx PR other than the standard auto-mirror.

There is some slight refactoring introduced here. Other coming PRs (string.ToUpper / string.GetHashCode) ride on top of this refactoring. Much of the upcoming UTF-8 logic also rides on top of this refactoring.

GrabYourPitchforks · 2018-10-05T01:21:35Z

src/System.Private.CoreLib/shared/System/Globalization/TextInfo.cs

                        {
-                            *b++ = ToLowerAsciiInvariant(*a++);
-                            length++;
+                            goto NonAsciiSkipTwoChars;


The idea behind these gotos is to keep execution flow within the central loop as tight as possible. In the common case (all ASCII data), the CPU just executes through the instructions sequentially; in the uncommon case (non-ASCII data), the CPU evaluates and follows a jcc instruction.

Putting logic like currIdx += 2; inside the if block would've introduced jmp statements in the common case, which I tried to avoid.

GrabYourPitchforks · 2018-10-05T01:23:24Z

src/System.Private.CoreLib/shared/System/Globalization/TextInfo.cs

+            ChangeCaseCommon<TConversion>(ref MemoryMarshal.GetReference(source), ref MemoryMarshal.GetReference(destination), source.Length);
+        }
+
+        private void ChangeCaseCommon<TConversion>(ref char source, ref char destination, int charCount) where TConversion : struct


Using 4 parameters (this + 3 explicit) allows x64 calling convention to enregister all parameters without stack-spilling.

src/System.Private.CoreLib/shared/Internal/Runtime/CompilerServices/Unsafe.cs

jkotas · 2018-10-05T01:28:13Z

src/System.Private.CoreLib/shared/System/Text/Utf16Utility.cs

+    internal static partial class Utf16Utility
+    {
+        /// <summary>
+        /// Returns true iff the DWORD represents two ASCII UTF-16 characters in machine endianness.


DWORD is Windowsism. This should say uint or UInt32.

GrabYourPitchforks · 2018-10-05T01:32:30Z

src/System.Private.CoreLib/shared/System/Globalization/TextInfo.cs

            }

-            fixed (char* pSource = &MemoryMarshal.GetReference(source))
-            fixed (char* pResult = &MemoryMarshal.GetReference(destination))
+            if (IsAsciiCasingSameAsInvariant)


If there's a desire to do so, we can refactor this "Boolean that might not be initialized" logic out into a standalone type. In an ideal world, the codegen would look like the following.

cmp byte ptr [ptr_to_field], 1 jl LABEL_NotTrue LABEL_True: .. logic where value evaluated to true .. LABEL_NotTrue: cmp byte ptr [ptr_to_field], 0 jl LABEL_NeedToEvaluate LABEL_False: .. logic where value evaluated to false .. LABEL_NeedToEvaluate: call Evaluate cmp byte ptr [ptr_to_field], 0 ja LABEL_True jmp LABEL_False

This means that in the common case of the value being already-evaluated and equal to true, no jumps at all are taken.
In the less common case of the value being already-evaluated and equal to false, only a single jcc is taken.
And in the uncommon case of the value not yet being evaluated, multiple jumps are taken.

jkotas · 2018-10-05T01:39:43Z

src/System.Private.CoreLib/shared/System/Globalization/TextInfo.cs

            }

-            fixed (char* pSource = &MemoryMarshal.GetReference(source))
-            fixed (char* pResult = &MemoryMarshal.GetReference(destination))


Is the minor saving from avoiding the pinning really worth it here? The byref-arithmetic is very hard to read.

It doesn't affect benchmark perf significantly. This is mainly to make the GC happier by keeping fewer objects pinned just in case a GC occurs.

Is the concern the syntax? We could add an internal method similar to the following, which would make the call sites look much nicer, while minimizing the number of pinned objects.

// new internal API on Unsafe class static internal ref T ReadUnaligned<T>(ref T @base, nuint index, nuint displacementBytes) { /* ... */ } // old call site // mov tmp, dword ptr [source + 2 * currIdx + 4] tempValue = Unsafe.ReadUnaligned<uint>(ref Unsafe.As<char, byte>(ref Unsafe.AddByteOffset(ref Unsafe.Add(ref source, (nint)currIdx), 4))); // new call site // mov tmp, dword ptr [source + 2 * currIdx + 4] tempValue = Unsafe.ReadUnaligned<char, uint>(ref source, (nint)currIdx, 4);

Alternatively, if we're ok with the GC impact of pinning objects more than necessary, we can just go back to raw pointers.

This is mainly to make the GC happier by keeping fewer objects pinned just in case a GC occurs.

Short-term pinning is not really a problem for the GC. We have it everywhere - the GC is tuned to deal with it well.

As expected, the benchmark doesn't change when we go back to using pointers.

I spoke briefly with Maoni and asked for guidance on when it would make sense to use pointers vs. when it would make sense to avoid pinning. She gave a good rule of thumb: try to avoid pinning if you're going to be running your logic for an extended period of time, and try to avoid pinning if you're working with potentially large data sets.

Since this method is quick and since we expect input strings to be small, I agree with your assessment that avoiding pinning doesn't really buy us much. Will revert.

jkotas · 2018-10-05T01:43:01Z

src/System.Private.CoreLib/shared/System/Globalization/TextInfo.cs

+#else // BIT64
+using nuint = System.UInt32;
+using nint = System.Int32;
+#endif // BIT64

 namespace System.Globalization
 {
    public partial class TextInfo : ICloneable, IDeserializationCallback
    {
        private enum Tristate : byte


Should the underlying type rather be sbyte since you are taking advantage of the negative value?

Per your other feedback, no longer using the negative value, so this is moot.

src/System.Private.CoreLib/shared/System/Globalization/TextInfo.cs

jkotas · 2018-10-05T01:49:01Z

src/System.Private.CoreLib/shared/System/Globalization/TextInfo.cs

-            False,
+            True = 1,
+            False = 0,
+            NotInitialized = 0x80,


Do you really need to change NonInitalized to be 0x80? It is kind of nice for NonInitalized to be 0 for flags like these.

stephentoub · 2018-10-05T13:57:18Z

src/System.Private.CoreLib/shared/System/Text/Utf16Utility.cs

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        internal static bool DWordAllCharsAreAscii(uint value)
+        {
+            return (value & ~0x007F007Fu) == 0;


There aren't any endianness issues here?

Correct - there are no endianness issues here. The reason is that the underlying char is already stored as machine-endian.

Consider a string with the UTF-16 code units [ U+AABB U+CCDD ]. On a big-endian machine, this will be stored in memory as the bytes 0xAA 0xBB 0xCC 0xDD, and reading this as a uint returns 0xAABBCCDD. On a little-endian machine, this will be stored in memory as the bytes 0xBB 0xAA 0xDD 0xCC, and reading this as a uint returns 0xCCDDAABB. Both of these work with the mask provided in this method.

jkotas · 2018-10-05T22:30:46Z

src/System.Private.CoreLib/shared/System/Globalization/TextInfo.cs

@@ -909,5 +1009,11 @@ private static bool IsLetterCategory(UnicodeCategory uc)
                 || uc == UnicodeCategory.ModifierLetter
                 || uc == UnicodeCategory.OtherLetter);
        }
+
+        // A dummy struct that is used for 'ToUpper' in generic parameters
+        private readonly struct ToUpperConversion { }


You can also do this by using aliases for existing types:

using ToUpperConversion = System.UInt32; using ToLowerConversion = System.Int32;

It is a micro-optimization. Feel free to ignore this if you think explicit types are better.

src/System.Private.CoreLib/shared/System/Text/Utf16Utility.cs

)

Improve performance of span-based ToUpper and related APIs

dd70225

GrabYourPitchforks commented Oct 5, 2018

View reviewed changes

jkotas reviewed Oct 5, 2018

View reviewed changes

src/System.Private.CoreLib/shared/Internal/Runtime/CompilerServices/Unsafe.cs Outdated Show resolved Hide resolved

jkotas reviewed Oct 5, 2018

View reviewed changes

GrabYourPitchforks commented Oct 5, 2018

View reviewed changes

jkotas reviewed Oct 5, 2018

View reviewed changes

src/System.Private.CoreLib/shared/System/Globalization/TextInfo.cs Outdated Show resolved Hide resolved

jkotas reviewed Oct 5, 2018

View reviewed changes

stephentoub reviewed Oct 5, 2018

View reviewed changes

Respond to PR feedback

6f3647f

jkotas reviewed Oct 5, 2018

View reviewed changes

jkotas approved these changes Oct 5, 2018

View reviewed changes

GrabYourPitchforks merged commit e16ff0b into dotnet:master Oct 6, 2018

This was referenced Oct 8, 2018

Improve performance of String.ToUpper and friends #20304

Merged

Improve performance of OrdinalIgnoreCase hash code calculation #20309

Merged

AntonLapounov reviewed Oct 11, 2018

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Utf16Utility.cs Show resolved Hide resolved

A-And pushed a commit to A-And/coreclr that referenced this pull request Nov 20, 2018

Improve performance of span-based ToUpper and related APIs (dotnet#20275

682c898

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of span-based ToUpper and related APIs #20275

Improve performance of span-based ToUpper and related APIs #20275

GrabYourPitchforks commented Oct 5, 2018 •

edited

Loading

GrabYourPitchforks Oct 5, 2018

GrabYourPitchforks Oct 5, 2018

jkotas Oct 5, 2018

GrabYourPitchforks Oct 5, 2018

jkotas Oct 5, 2018

GrabYourPitchforks Oct 5, 2018

jkotas Oct 5, 2018

GrabYourPitchforks Oct 5, 2018

jkotas Oct 5, 2018

GrabYourPitchforks Oct 5, 2018

jkotas Oct 5, 2018

stephentoub Oct 5, 2018

GrabYourPitchforks Oct 5, 2018

jkotas Oct 5, 2018

Improve performance of span-based ToUpper and related APIs #20275

Improve performance of span-based ToUpper and related APIs #20275

Conversation

GrabYourPitchforks commented Oct 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GrabYourPitchforks commented Oct 5, 2018 •

edited

Loading