Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize percent-encoded UTF8 processing in Uri #32552

Merged
merged 17 commits into from
Dec 14, 2020

Conversation

MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Feb 19, 2020

When unescaping percent-encoded non-ascii we currently:

  1. allocate a byte[] buffer
  2. decode the entire hex encoded uri into bytes
  3. allocate a char[] buffer
  4. convert the bytes into chars via Utf8Encoding
  5. analyze both buffers to see if any characters/bytes were skipped by converting chars to Runes to Utf8 bytes and comparing

This PR changes it into performing a single pass, writing to the ValueStringBuilder without temporary buffers.

Currently there is a behavioral change where before all hex characters would be upper-cased, now their input-casing is preserved. Keeping the old behavior is a trivial change with a bit of a perf penalty.

I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change.

Perf goes up significantly whenever this unescaping path is hit
(The allocation win is hit whenever there is a single non-ascii char in the input)

Method Toolchain Mean Ratio Gen 0 Allocated
NewUri_Chinese \clean\CoreRun.exe 11,644.9 ns 1.57 1.2817 5384 B
NewUri_Chinese \new\CoreRun.exe 7,422.8 ns 1.00 0.2136 920 B
UnescapeDataString_Chinese \clean\CoreRun.exe 9,514.7 ns 2.24 1.0986 4664 B
UnescapeDataString_Chinese \new\CoreRun.exe 4,245.9 ns 1.00 0.0763 344 B
UnescapeDataString_Chinese_Short \clean\CoreRun.exe 1,402.5 ns 3.03 0.1545 656 B
UnescapeDataString_Chinese_Short \new\CoreRun.exe 462.7 ns 1.00 0.0148 64 B
UnescapeDataString_Emoji \clean\CoreRun.exe 53,014.0 ns 2.62 9.5215 40072 B
UnescapeDataString_Emoji \new\CoreRun.exe 20,259.3 ns 1.00 0.9460 4024 B

Updated benchmarks: #32552 (comment)

@MihaZupan MihaZupan added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) area-System.Net labels Feb 19, 2020
@MihaZupan MihaZupan added this to the 5.0 milestone Feb 19, 2020
@@ -110,7 +110,8 @@ public void Iri_804110_TryCreateUri_ShouldNotThrowIndexOutOfRange()

Assert.Equal(
"http://con.tosoco.ntosoc.com/abcdefghi/jk" + "%C8%F3%B7%A2%B7%BF%B2%FA",
resultUri.ToString());
resultUri.ToString(),
StringComparer.OrdinalIgnoreCase);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of where the output would be different - the casing from the input would be preserved instead of upper-casing the hex chars

{
value = (value | 0x20) - 'a' + 10;
}
else if ((value - '8') <= ('9' - '8'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use your helpers in the other file for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually inlined it here, as we are only interested in non-ascii ones

{
value = ((value << 4) + second) - '0';
}
else if ((uint)((second - 'A') & ~0x20) <= ('F' - 'A'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And other places in here.

@GrabYourPitchforks
Copy link
Member

Removed these UTF8SequenceTests as the method doesn't exist anymore. Will add similar edge-case utf8 tests targeting PercentEncodingHelper to this PR later

PR generally looks good, thanks! :)

Waiting for the re-introduction of the removed tests in the meantime.

@MihaZupan
Copy link
Member Author

MihaZupan commented Feb 20, 2020

I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change.

Comment on lines +116 to +117
dest.Append(pInput[i++]);
dest.Append(pInput[i++]);
dest.Append(pInput[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each Append will perform a bounds check. Can you use the Append(char *, int) overload here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That overload will do a Span slice as well. I was thinking of adding Append(char, char) and Append(char, char, char) overloads as they can have a measurable perf impact (that would be part of a separate PR adressing #22903).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append(char, char) and Append(char, char, char) seem quite awkward API choices, IMHO. Maybe make the Append(char *, int) overload not create a Span slice?

@MihaZupan MihaZupan removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Feb 22, 2020
@MihaZupan MihaZupan closed this Mar 10, 2020
@MihaZupan MihaZupan reopened this Mar 10, 2020
@MihaZupan
Copy link
Member Author

@stephentoub @GrabYourPitchforks @dotnet/ncl
Any objections against merging this as-is? Otherwise please approve.


if (Rune.DecodeFromUtf8(fourByteSpan.Slice(0, bytesLeftInBuffer), out Rune rune, out bytesConsumed) == OperationStatus.Done)
{
Debug.Assert(bytesConsumed >= 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is theoretically possible to violate this assertion. I'm having trouble following the logic of this method so I don't know if an assertion violation will end up possibly buffer overrunning or otherwise having a negative effect.

Entry to method:

char* input = "%FA%FB%00"
fourByteBuffer = 0
bytesLeftInBuffer = 0
totalCharsConsumed = 0
charsToCopy = 0
bytesConsumed = 0

<after a few rounds of ReadByteFromInput>

fourByteBuffer = 0xFBFA0000 (buffer = [ 00 00 FA FB ])
bytesLeftInBuffer = 2

<go to NoMoreOrInvalidInput>

fourByteBuffer = 0x0000FBFA (buffer = [ FA FB 00 00 ])
bytesLeftInBuffer = 2

<go to DecodeRune>

DecodeFromUtf8 = InvalidData
bytesConsumed = 1
charsToCopy = 3

<go to AfterDecodeRune>

bytesLeftInBuffer = 1
totalCharsConsumed = 3

## meanwhile, another thread changes 'input' to read "%FA%FB%FC" ##

<go to RefillBuffer>

i = 3 + (1 * 3) = 6

<go to ReadByteFromInput>

fourByteBuffer = 0xFC0000FB (buffer = [ FB 00 00 FC ])
bytesLeftInBuffer = 2

<go to NoMoreOrInvalidInput>

## recall: bytesConsumed is still 1 from earlier DecodeRune call

fourByteBuffer = 0x00FC0000 (buffer = [ 00 00 FC 00 ])
bytesLeftInBuffer = 2

<go to DecodeRune>

DecodeFromUtf8 = Done
bytesConsumed = 1
Debug.Assert(bytesConsumed >= 2); // assertion would fail but not present in release branch

dest.Append(input + 3 - 3, 3); // copy 3 chars
charsToCopy = 0
charsToCopy = 3

<go to AfterDecodeRune>

...

Copy link
Member Author

@MihaZupan MihaZupan May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there would be no negative side-effects in this case. While the assert would fail if present, we rely on Rune.DecodeFromUtf8 to always consume at least 1 byte in each iteration, so the loop will eventually exit. bytesConsumed being 1 instead of >= 2 will not impact memory accesses into the input.
I added a comment to this assert indicating the scenario under which it may fail.

In general, is the scenario of a string instance being modified after creation something that we should be/are considering? I would be surprised if there aren't other APIs across runtime that make the assumption of string immutability with worse side-effects.

@MihaZupan
Copy link
Member Author

The recent changes improve the throughput by another 10-20% depending on input.

Method Toolchain Mean Ratio Gen 0 Allocated
UnescapeDataString_Chineese \base\CoreRun.exe 9.373 us 2.62 1.1139 4664 B
UnescapeDataString_Chineese \new\CoreRun.exe 3.586 us 1.00 0.0801 344 B
UnescapeDataString_Emoji \base\CoreRun.exe 55.043 us 2.94 9.5215 40074 B
UnescapeDataString_Emoji \new\CoreRun.exe 18.711 us 1.00 0.9460 4025 B

@stephentoub
Copy link
Member

@MihaZupan, what's the next step here? Are you waiting for reviews? Is it ready to merge?

@karelz karelz modified the milestones: 5.0.0, 6.0.0 Aug 31, 2020
@stephentoub
Copy link
Member

@GrabYourPitchforks, I assume no response means you're good with this now? Thanks.

@stephentoub
Copy link
Member

@MihaZupan, can you rebase this to resolve the conflicts? Thanks.

@MihaZupan MihaZupan force-pushed the uri-cleanup-percent-encoded-utf8 branch from 70606d0 to 2a78dc8 Compare October 6, 2020 10:46
@danmoseley
Copy link
Member

Noticed this is the oldest Microsoft PR in the repo now.

@GrabYourPitchforks any remaining feedback or is this ready to go?

@danmoseley
Copy link
Member

@scalablecory is your feedback addressed? Trying to get our 90th percentile PR age down..

@MihaZupan
Copy link
Member Author

Test failures are unrelated.

Sent an email to @GrabYourPitchforks about finalizing the review here, otherwise I will be closing the PR until we can get someone to take a look at it.

@MihaZupan MihaZupan merged commit e53e543 into dotnet:master Dec 14, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants