Optimize percent-encoded UTF8 processing in Uri #32552

MihaZupan · 2020-02-19T19:36:42Z

When unescaping percent-encoded non-ascii we currently:

allocate a byte[] buffer
decode the entire hex encoded uri into bytes
allocate a char[] buffer
convert the bytes into chars via Utf8Encoding
analyze both buffers to see if any characters/bytes were skipped by converting chars to Runes to Utf8 bytes and comparing

This PR changes it into performing a single pass, writing to the ValueStringBuilder without temporary buffers.

Currently there is a behavioral change where before all hex characters would be upper-cased, now their input-casing is preserved. Keeping the old behavior is a trivial change with a bit of a perf penalty.

I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change.

Perf goes up significantly whenever this unescaping path is hit
(The allocation win is hit whenever there is a single non-ascii char in the input)

Method	Toolchain	Mean	Ratio	Gen 0	Allocated
NewUri_Chinese	\clean\CoreRun.exe	11,644.9 ns	1.57	1.2817	5384 B
NewUri_Chinese	\new\CoreRun.exe	7,422.8 ns	1.00	0.2136	920 B

UnescapeDataString_Chinese	\clean\CoreRun.exe	9,514.7 ns	2.24	1.0986	4664 B
UnescapeDataString_Chinese	\new\CoreRun.exe	4,245.9 ns	1.00	0.0763	344 B

UnescapeDataString_Chinese_Short	\clean\CoreRun.exe	1,402.5 ns	3.03	0.1545	656 B
UnescapeDataString_Chinese_Short	\new\CoreRun.exe	462.7 ns	1.00	0.0148	64 B

UnescapeDataString_Emoji	\clean\CoreRun.exe	53,014.0 ns	2.62	9.5215	40072 B
UnescapeDataString_Emoji	\new\CoreRun.exe	20,259.3 ns	1.00	0.9460	4024 B

Updated benchmarks: #32552 (comment)

src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs

MihaZupan · 2020-02-19T19:41:13Z

src/libraries/System.Private.Uri/tests/FunctionalTests/IriTest.cs

@@ -110,7 +110,8 @@ public void Iri_804110_TryCreateUri_ShouldNotThrowIndexOutOfRange()

            Assert.Equal(
                "http://con.tosoco.ntosoc.com/abcdefghi/jk" + "%C8%F3%B7%A2%B7%BF%B2%FA",
-                resultUri.ToString());
+                resultUri.ToString(),
+                StringComparer.OrdinalIgnoreCase);


This is an example of where the output would be different - the casing from the input would be preserved instead of upper-casing the hex chars

src/libraries/System.Private.Uri/tests/UnitTests/IriEscapeUnescapeTest.cs

src/libraries/System.Private.Uri/src/System/IriHelper.cs

scalablecory · 2020-02-20T00:12:01Z

src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs

+            {
+                value = (value | 0x20) - 'a' + 10;
+            }
+            else if ((value - '8') <= ('9' - '8'))


You could use your helpers in the other file for this.

I manually inlined it here, as we are only interested in non-ascii ones

scalablecory · 2020-02-20T00:12:34Z

src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs

+            {
+                value = ((value << 4) + second) - '0';
+            }
+            else if ((uint)((second - 'A') & ~0x20) <= ('F' - 'A'))


And other places in here.

src/libraries/System.Private.Uri/src/System/IriHelper.cs

src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs

GrabYourPitchforks · 2020-02-20T06:18:01Z

Removed these UTF8SequenceTests as the method doesn't exist anymore. Will add similar edge-case utf8 tests targeting PercentEncodingHelper to this PR later

PR generally looks good, thanks! :)

Waiting for the re-introduction of the removed tests in the meantime.

MihaZupan · 2020-02-20T15:58:35Z

I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change.

src/libraries/System.Private.Uri/src/System/IriHelper.cs

lpereira · 2020-02-20T18:57:23Z

src/libraries/System.Private.Uri/src/System/IriHelper.cs

+                            dest.Append(pInput[i++]);
+                            dest.Append(pInput[i++]);
+                            dest.Append(pInput[i]);


Each Append will perform a bounds check. Can you use the Append(char *, int) overload here?

That overload will do a Span slice as well. I was thinking of adding Append(char, char) and Append(char, char, char) overloads as they can have a measurable perf impact (that would be part of a separate PR adressing #22903).

Append(char, char) and Append(char, char, char) seem quite awkward API choices, IMHO. Maybe make the Append(char *, int) overload not create a Span slice?

src/libraries/System.Private.Uri/tests/FunctionalTests/PercentEncodingHelperTests.cs

MihaZupan · 2020-03-13T18:38:55Z

@stephentoub @GrabYourPitchforks @dotnet/ncl
Any objections against merging this as-is? Otherwise please approve.

src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs

GrabYourPitchforks · 2020-04-29T00:06:09Z

src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs

+
+            if (Rune.DecodeFromUtf8(fourByteSpan.Slice(0, bytesLeftInBuffer), out Rune rune, out bytesConsumed) == OperationStatus.Done)
+            {
+                Debug.Assert(bytesConsumed >= 2);


It is theoretically possible to violate this assertion. I'm having trouble following the logic of this method so I don't know if an assertion violation will end up possibly buffer overrunning or otherwise having a negative effect.

Entry to method: char* input = "%FA%FB%00" fourByteBuffer = 0 bytesLeftInBuffer = 0 totalCharsConsumed = 0 charsToCopy = 0 bytesConsumed = 0 <after a few rounds of ReadByteFromInput> fourByteBuffer = 0xFBFA0000 (buffer = [ 00 00 FA FB ]) bytesLeftInBuffer = 2 <go to NoMoreOrInvalidInput> fourByteBuffer = 0x0000FBFA (buffer = [ FA FB 00 00 ]) bytesLeftInBuffer = 2 <go to DecodeRune> DecodeFromUtf8 = InvalidData bytesConsumed = 1 charsToCopy = 3 <go to AfterDecodeRune> bytesLeftInBuffer = 1 totalCharsConsumed = 3 ## meanwhile, another thread changes 'input' to read "%FA%FB%FC" ## <go to RefillBuffer> i = 3 + (1 * 3) = 6 <go to ReadByteFromInput> fourByteBuffer = 0xFC0000FB (buffer = [ FB 00 00 FC ]) bytesLeftInBuffer = 2 <go to NoMoreOrInvalidInput> ## recall: bytesConsumed is still 1 from earlier DecodeRune call fourByteBuffer = 0x00FC0000 (buffer = [ 00 00 FC 00 ]) bytesLeftInBuffer = 2 <go to DecodeRune> DecodeFromUtf8 = Done bytesConsumed = 1 Debug.Assert(bytesConsumed >= 2); // assertion would fail but not present in release branch dest.Append(input + 3 - 3, 3); // copy 3 chars charsToCopy = 0 charsToCopy = 3 <go to AfterDecodeRune> ...

I believe there would be no negative side-effects in this case. While the assert would fail if present, we rely on Rune.DecodeFromUtf8 to always consume at least 1 byte in each iteration, so the loop will eventually exit. bytesConsumed being 1 instead of >= 2 will not impact memory accesses into the input.
I added a comment to this assert indicating the scenario under which it may fail.

In general, is the scenario of a string instance being modified after creation something that we should be/are considering? I would be surprised if there aren't other APIs across runtime that make the assumption of string immutability with worse side-effects.

MihaZupan · 2020-05-12T14:21:57Z

The recent changes improve the throughput by another 10-20% depending on input.

Method	Toolchain	Mean	Ratio	Gen 0	Allocated
UnescapeDataString_Chineese	\base\CoreRun.exe	9.373 us	2.62	1.1139	4664 B
UnescapeDataString_Chineese	\new\CoreRun.exe	3.586 us	1.00	0.0801	344 B

UnescapeDataString_Emoji	\base\CoreRun.exe	55.043 us	2.94	9.5215	40074 B
UnescapeDataString_Emoji	\new\CoreRun.exe	18.711 us	1.00	0.9460	4025 B

stephentoub · 2020-06-20T19:31:04Z

@MihaZupan, what's the next step here? Are you waiting for reviews? Is it ready to merge?

stephentoub · 2020-09-14T21:26:04Z

@GrabYourPitchforks, I assume no response means you're good with this now? Thanks.

stephentoub · 2020-10-05T13:08:50Z

@MihaZupan, can you rebase this to resolve the conflicts? Thanks.

danmoseley · 2020-10-12T23:43:03Z

Noticed this is the oldest Microsoft PR in the repo now.

@GrabYourPitchforks any remaining feedback or is this ready to go?

danmoseley · 2020-10-12T23:43:48Z

@scalablecory is your feedback addressed? Trying to get our 90th percentile PR age down..

MihaZupan · 2020-12-07T16:10:24Z

Test failures are unrelated.

Sent an email to @GrabYourPitchforks about finalizing the review here, otherwise I will be closing the PR until we can get someone to take a look at it.

src/libraries/System.Private.Uri/src/System/IriHelper.cs

src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs

MihaZupan added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) area-System.Net labels Feb 19, 2020

MihaZupan added this to the 5.0 milestone Feb 19, 2020

MihaZupan requested review from GrabYourPitchforks, stephentoub and a team February 19, 2020 19:36

MihaZupan commented Feb 19, 2020

View reviewed changes

src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs Outdated Show resolved Hide resolved

MihaZupan commented Feb 19, 2020

View reviewed changes

src/libraries/System.Private.Uri/tests/UnitTests/IriEscapeUnescapeTest.cs Show resolved Hide resolved

scalablecory reviewed Feb 20, 2020

View reviewed changes

src/libraries/System.Private.Uri/src/System/IriHelper.cs Outdated Show resolved Hide resolved

scalablecory reviewed Feb 20, 2020

View reviewed changes

GrabYourPitchforks reviewed Feb 20, 2020

View reviewed changes

src/libraries/System.Private.Uri/src/System/IriHelper.cs Outdated Show resolved Hide resolved

src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs Show resolved Hide resolved

src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs Outdated Show resolved Hide resolved

lpereira reviewed Feb 20, 2020

View reviewed changes

src/libraries/System.Private.Uri/src/System/IriHelper.cs Show resolved Hide resolved

lpereira reviewed Feb 20, 2020

View reviewed changes

MihaZupan removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Feb 22, 2020

GrabYourPitchforks reviewed Feb 24, 2020

View reviewed changes

MihaZupan closed this Mar 10, 2020

MihaZupan reopened this Mar 10, 2020

stephentoub reviewed Apr 8, 2020

View reviewed changes

src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs Outdated Show resolved Hide resolved

This was referenced Apr 12, 2020

Remove excess allocations in Uri.ReCreateParts #34864

Merged

[Uri] A lightweight alternative for System.Uri #34873

Closed

MihaZupan requested a review from GrabYourPitchforks April 28, 2020 13:42

GrabYourPitchforks reviewed Apr 29, 2020

View reviewed changes

ericstj mentioned this pull request Apr 29, 2020

Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout #13610

Closed

karelz modified the milestones: 5.0.0, 6.0.0 Aug 31, 2020

MihaZupan added 15 commits October 6, 2020 12:17

Optimize percent-encoded UTF8 processing in Uri

10d7f9c

Rename charsConsumed to bytesConsumed

4e17863

Use ValueStringBuilder Append(char*, int) instead of Append(ROS<char>)

676e30b

Add tests for PercentEncodingHelper

1870ee8

Use string literals instead of char.ConvertFromUtf32

1a4a57c

Use sizeof(uint) instead of 4

356eb36

Add missing license headers

8b1acdc

Improve codegen by using temporary local copy

a67cbf8

Correct Debug asserts

cf7d21f

Add ValueStringBuilder.Append(Rune)

398d4db

Improve hex decoding throughput

8770686

Move VSB.Append(Rune) to a Uri-specific partial VSB file

e2b5621

Add missing csproj link

15b746e

Add more comments documenting PercentEncodingHelper's logic

58332e4

Fix rebase conflicts

2a78dc8

MihaZupan force-pushed the uri-cleanup-percent-encoded-utf8 branch from 70606d0 to 2a78dc8 Compare October 6, 2020 10:46

GrabYourPitchforks reviewed Dec 7, 2020

View reviewed changes

GrabYourPitchforks approved these changes Dec 7, 2020

View reviewed changes

Merge branch 'master' into uri-cleanup-percent-encoded-utf8

ce9aafc

karelz assigned MihaZupan Dec 8, 2020

Address PR feedback

7efd732

MihaZupan merged commit e53e543 into dotnet:master Dec 14, 2020

ghost locked as resolved and limited conversation to collaborators Jan 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize percent-encoded UTF8 processing in Uri #32552

Optimize percent-encoded UTF8 processing in Uri #32552

MihaZupan commented Feb 19, 2020 •

edited

Loading

MihaZupan Feb 19, 2020

scalablecory Feb 20, 2020

MihaZupan Feb 20, 2020

scalablecory Feb 20, 2020

GrabYourPitchforks commented Feb 20, 2020

MihaZupan commented Feb 20, 2020 •

edited

Loading

lpereira Feb 20, 2020

MihaZupan Feb 22, 2020

lpereira Feb 24, 2020

MihaZupan commented Mar 13, 2020

GrabYourPitchforks Apr 29, 2020

MihaZupan May 12, 2020 •

edited

Loading

MihaZupan commented May 12, 2020

stephentoub commented Jun 20, 2020

stephentoub commented Sep 14, 2020

stephentoub commented Oct 5, 2020

danmoseley commented Oct 12, 2020

danmoseley commented Oct 12, 2020

MihaZupan commented Dec 7, 2020

Optimize percent-encoded UTF8 processing in Uri #32552

Optimize percent-encoded UTF8 processing in Uri #32552

Conversation

MihaZupan commented Feb 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GrabYourPitchforks commented Feb 20, 2020

MihaZupan commented Feb 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MihaZupan commented Mar 13, 2020

Choose a reason for hiding this comment

MihaZupan May 12, 2020 • edited Loading

Choose a reason for hiding this comment

MihaZupan commented May 12, 2020

stephentoub commented Jun 20, 2020

stephentoub commented Sep 14, 2020

stephentoub commented Oct 5, 2020

danmoseley commented Oct 12, 2020

danmoseley commented Oct 12, 2020

MihaZupan commented Dec 7, 2020

MihaZupan commented Feb 19, 2020 •

edited

Loading

MihaZupan commented Feb 20, 2020 •

edited

Loading

MihaZupan May 12, 2020 •

edited

Loading