Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improve performance of span-based ToUpper and related APIs #20275

Merged
merged 2 commits into from
Oct 6, 2018

Conversation

GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Oct 5, 2018

This PR improves the performance of the span-based ToUpper / ToUpperInvariant / ToLower / ToLowerInvariant methods. Perf results are provided in the table below.

Method Toolchain StringLength Mean Error StdDev Scaled ScaledSD
ToUpperInvariant_Ascii 3.0.0-preview1-27004-04 0 26.05 ns 0.5008 ns 0.4684 ns 1.00 0.00
ToUpperInvariant_Ascii local 0 27.16 ns 0.3857 ns 0.3608 ns 1.04 0.02
ToUpperInvariant_Ascii 3.0.0-preview1-27004-04 4 36.21 ns 0.5745 ns 0.5092 ns 1.00 0.00
ToUpperInvariant_Ascii local 4 28.94 ns 0.4919 ns 0.4360 ns 0.80 0.02
ToUpperInvariant_Ascii 3.0.0-preview1-27004-04 5 36.27 ns 0.4766 ns 0.3980 ns 1.00 0.00
ToUpperInvariant_Ascii local 5 29.20 ns 0.4267 ns 0.3783 ns 0.81 0.01
ToUpperInvariant_Ascii 3.0.0-preview1-27004-04 6 36.40 ns 0.3273 ns 0.3061 ns 1.00 0.00
ToUpperInvariant_Ascii local 6 29.13 ns 0.2871 ns 0.2397 ns 0.80 0.01
ToUpperInvariant_Ascii 3.0.0-preview1-27004-04 8 38.01 ns 0.8254 ns 1.3092 ns 1.00 0.00
ToUpperInvariant_Ascii local 8 30.76 ns 0.7869 ns 1.0504 ns 0.81 0.04
ToUpperInvariant_Ascii 3.0.0-preview1-27004-04 12 42.06 ns 0.9044 ns 0.8882 ns 1.00 0.00
ToUpperInvariant_Ascii local 12 35.21 ns 0.5008 ns 0.4684 ns 0.84 0.02
ToUpperInvariant_Ascii 3.0.0-preview1-27004-04 100 133.87 ns 2.9730 ns 6.1398 ns 1.00 0.00
ToUpperInvariant_Ascii local 100 75.62 ns 1.3335 ns 1.1821 ns 0.57 0.03
ToUpperInvariant_Ascii 3.0.0-preview1-27004-04 1000 1,255.32 ns 24.6229 ns 38.3349 ns 1.00 0.00
ToUpperInvariant_Ascii local 1000 488.50 ns 5.0715 ns 4.7439 ns 0.39 0.01

The testbed generates a random all-ASCII string of the specified length (using a predictable RNG with a constant seed), then calls the span-based ToUpperInvariant to write the result into a scratch buffer. The random string has a mix of lowercase and uppercase characters. I also ran the test for ToUpper(..., <tr-TR culture>) and ToUpperInvariant(<string containing non-ASCII chars>, ...) and saw no noticeable difference from the existing in-box implementation, as expected. The logic continues to fall back to the native localization tables in those cases.

Various notes for this PR that will make reviewing easier:

  • TConversion leverages the JIT's generic specialization logic to output two different codegens: one for ToUpper, one for ToLower. This pattern allows us to get away with having only one copy of this logic in source.
  • The case of IsAsciiCasingSameAsInvariant having already been computed is inlined into the caller. This saves a method call and some other work in the common case.
  • This adds ref T Unsafe.Add<T>(ref T, nint elementOffset) (not exposed via the reference assemblies). Having this specific overload of Add<T> makes certain scenarios easier and prevents us from having to perform the AddByteOffset calculation manually.
  • The internal Utf16Utility type is meant for bitwise inspection of UTF-16 data. It's a partial port of the type from the feature/utf8string feature branch. I considered putting these APIs on an existing type, but no type really stood out as a good candidate. Future PRs will add more APIs to this type. (The string.ToUpper and string.GetHashCode PRs add methods to this.)

No APIs are introduced by this PR. I do not expect a corresponding corefx PR other than the standard auto-mirror.

There is some slight refactoring introduced here. Other coming PRs (string.ToUpper / string.GetHashCode) ride on top of this refactoring. Much of the upcoming UTF-8 logic also rides on top of this refactoring.

{
*b++ = ToLowerAsciiInvariant(*a++);
length++;
goto NonAsciiSkipTwoChars;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind these gotos is to keep execution flow within the central loop as tight as possible. In the common case (all ASCII data), the CPU just executes through the instructions sequentially; in the uncommon case (non-ASCII data), the CPU evaluates and follows a jcc instruction.

Putting logic like currIdx += 2; inside the if block would've introduced jmp statements in the common case, which I tried to avoid.

ChangeCaseCommon<TConversion>(ref MemoryMarshal.GetReference(source), ref MemoryMarshal.GetReference(destination), source.Length);
}

private void ChangeCaseCommon<TConversion>(ref char source, ref char destination, int charCount) where TConversion : struct
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 4 parameters (this + 3 explicit) allows x64 calling convention to enregister all parameters without stack-spilling.

internal static partial class Utf16Utility
{
/// <summary>
/// Returns true iff the DWORD represents two ASCII UTF-16 characters in machine endianness.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DWORD is Windowsism. This should say uint or UInt32.

}

fixed (char* pSource = &MemoryMarshal.GetReference(source))
fixed (char* pResult = &MemoryMarshal.GetReference(destination))
if (IsAsciiCasingSameAsInvariant)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a desire to do so, we can refactor this "Boolean that might not be initialized" logic out into a standalone type. In an ideal world, the codegen would look like the following.

cmp byte ptr [ptr_to_field], 1
jl LABEL_NotTrue
LABEL_True:
.. logic where value evaluated to true ..

LABEL_NotTrue:
cmp byte ptr [ptr_to_field], 0
jl LABEL_NeedToEvaluate
LABEL_False:
.. logic where value evaluated to false ..

LABEL_NeedToEvaluate:
call Evaluate
cmp byte ptr [ptr_to_field], 0
ja LABEL_True
jmp LABEL_False

This means that in the common case of the value being already-evaluated and equal to true, no jumps at all are taken.
In the less common case of the value being already-evaluated and equal to false, only a single jcc is taken.
And in the uncommon case of the value not yet being evaluated, multiple jumps are taken.

}

fixed (char* pSource = &MemoryMarshal.GetReference(source))
fixed (char* pResult = &MemoryMarshal.GetReference(destination))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the minor saving from avoiding the pinning really worth it here? The byref-arithmetic is very hard to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't affect benchmark perf significantly. This is mainly to make the GC happier by keeping fewer objects pinned just in case a GC occurs.

Is the concern the syntax? We could add an internal method similar to the following, which would make the call sites look much nicer, while minimizing the number of pinned objects.

// new internal API on Unsafe class
static internal ref T ReadUnaligned<T>(ref T @base, nuint index, nuint displacementBytes) { /* ... */ }

// old call site
// mov tmp, dword ptr [source + 2 * currIdx + 4]
tempValue = Unsafe.ReadUnaligned<uint>(ref Unsafe.As<char, byte>(ref Unsafe.AddByteOffset(ref Unsafe.Add(ref source, (nint)currIdx), 4)));

// new call site
// mov tmp, dword ptr [source + 2 * currIdx + 4]
tempValue = Unsafe.ReadUnaligned<char, uint>(ref source, (nint)currIdx, 4);

Alternatively, if we're ok with the GC impact of pinning objects more than necessary, we can just go back to raw pointers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mainly to make the GC happier by keeping fewer objects pinned just in case a GC occurs.

Short-term pinning is not really a problem for the GC. We have it everywhere - the GC is tuned to deal with it well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As expected, the benchmark doesn't change when we go back to using pointers.

I spoke briefly with Maoni and asked for guidance on when it would make sense to use pointers vs. when it would make sense to avoid pinning. She gave a good rule of thumb: try to avoid pinning if you're going to be running your logic for an extended period of time, and try to avoid pinning if you're working with potentially large data sets.

Since this method is quick and since we expect input strings to be small, I agree with your assessment that avoiding pinning doesn't really buy us much. Will revert.

#else // BIT64
using nuint = System.UInt32;
using nint = System.Int32;
#endif // BIT64

namespace System.Globalization
{
public partial class TextInfo : ICloneable, IDeserializationCallback
{
private enum Tristate : byte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the underlying type rather be sbyte since you are taking advantage of the negative value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per your other feedback, no longer using the negative value, so this is moot.

False,
True = 1,
False = 0,
NotInitialized = 0x80,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need to change NonInitalized to be 0x80? It is kind of nice for NonInitalized to be 0 for flags like these.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool DWordAllCharsAreAscii(uint value)
{
return (value & ~0x007F007Fu) == 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't any endianness issues here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - there are no endianness issues here. The reason is that the underlying char is already stored as machine-endian.

Consider a string with the UTF-16 code units [ U+AABB U+CCDD ]. On a big-endian machine, this will be stored in memory as the bytes 0xAA 0xBB 0xCC 0xDD, and reading this as a uint returns 0xAABBCCDD. On a little-endian machine, this will be stored in memory as the bytes 0xBB 0xAA 0xDD 0xCC, and reading this as a uint returns 0xCCDDAABB. Both of these work with the mask provided in this method.

@@ -909,5 +1009,11 @@ private static bool IsLetterCategory(UnicodeCategory uc)
|| uc == UnicodeCategory.ModifierLetter
|| uc == UnicodeCategory.OtherLetter);
}

// A dummy struct that is used for 'ToUpper' in generic parameters
private readonly struct ToUpperConversion { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also do this by using aliases for existing types:

using ToUpperConversion = System.UInt32;
using ToLowerConversion = System.Int32;

It is a micro-optimization. Feel free to ignore this if you think explicit types are better.

A-And pushed a commit to A-And/coreclr that referenced this pull request Nov 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants