-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize percent-encoded UTF8 processing in Uri #32552
Optimize percent-encoded UTF8 processing in Uri #32552
Conversation
src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs
Outdated
Show resolved
Hide resolved
@@ -110,7 +110,8 @@ public void Iri_804110_TryCreateUri_ShouldNotThrowIndexOutOfRange() | |||
|
|||
Assert.Equal( | |||
"http://con.tosoco.ntosoc.com/abcdefghi/jk" + "%C8%F3%B7%A2%B7%BF%B2%FA", | |||
resultUri.ToString()); | |||
resultUri.ToString(), | |||
StringComparer.OrdinalIgnoreCase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of where the output would be different - the casing from the input would be preserved instead of upper-casing the hex chars
src/libraries/System.Private.Uri/tests/UnitTests/IriEscapeUnescapeTest.cs
Show resolved
Hide resolved
{ | ||
value = (value | 0x20) - 'a' + 10; | ||
} | ||
else if ((value - '8') <= ('9' - '8')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use your helpers in the other file for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually inlined it here, as we are only interested in non-ascii ones
{ | ||
value = ((value << 4) + second) - '0'; | ||
} | ||
else if ((uint)((second - 'A') & ~0x20) <= ('F' - 'A')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And other places in here.
src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs
Outdated
Show resolved
Hide resolved
PR generally looks good, thanks! :) Waiting for the re-introduction of the removed tests in the meantime. |
I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change. |
dest.Append(pInput[i++]); | ||
dest.Append(pInput[i++]); | ||
dest.Append(pInput[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each Append
will perform a bounds check. Can you use the Append(char *, int)
overload here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That overload will do a Span slice as well. I was thinking of adding Append(char, char)
and Append(char, char, char)
overloads as they can have a measurable perf impact (that would be part of a separate PR adressing #22903).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Append(char, char)
and Append(char, char, char)
seem quite awkward API choices, IMHO. Maybe make the Append(char *, int)
overload not create a Span slice?
src/libraries/System.Private.Uri/tests/FunctionalTests/PercentEncodingHelperTests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.Uri/tests/FunctionalTests/PercentEncodingHelperTests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.Uri/tests/FunctionalTests/PercentEncodingHelperTests.cs
Outdated
Show resolved
Hide resolved
@stephentoub @GrabYourPitchforks @dotnet/ncl |
src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs
Outdated
Show resolved
Hide resolved
|
||
if (Rune.DecodeFromUtf8(fourByteSpan.Slice(0, bytesLeftInBuffer), out Rune rune, out bytesConsumed) == OperationStatus.Done) | ||
{ | ||
Debug.Assert(bytesConsumed >= 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is theoretically possible to violate this assertion. I'm having trouble following the logic of this method so I don't know if an assertion violation will end up possibly buffer overrunning or otherwise having a negative effect.
Entry to method:
char* input = "%FA%FB%00"
fourByteBuffer = 0
bytesLeftInBuffer = 0
totalCharsConsumed = 0
charsToCopy = 0
bytesConsumed = 0
<after a few rounds of ReadByteFromInput>
fourByteBuffer = 0xFBFA0000 (buffer = [ 00 00 FA FB ])
bytesLeftInBuffer = 2
<go to NoMoreOrInvalidInput>
fourByteBuffer = 0x0000FBFA (buffer = [ FA FB 00 00 ])
bytesLeftInBuffer = 2
<go to DecodeRune>
DecodeFromUtf8 = InvalidData
bytesConsumed = 1
charsToCopy = 3
<go to AfterDecodeRune>
bytesLeftInBuffer = 1
totalCharsConsumed = 3
## meanwhile, another thread changes 'input' to read "%FA%FB%FC" ##
<go to RefillBuffer>
i = 3 + (1 * 3) = 6
<go to ReadByteFromInput>
fourByteBuffer = 0xFC0000FB (buffer = [ FB 00 00 FC ])
bytesLeftInBuffer = 2
<go to NoMoreOrInvalidInput>
## recall: bytesConsumed is still 1 from earlier DecodeRune call
fourByteBuffer = 0x00FC0000 (buffer = [ 00 00 FC 00 ])
bytesLeftInBuffer = 2
<go to DecodeRune>
DecodeFromUtf8 = Done
bytesConsumed = 1
Debug.Assert(bytesConsumed >= 2); // assertion would fail but not present in release branch
dest.Append(input + 3 - 3, 3); // copy 3 chars
charsToCopy = 0
charsToCopy = 3
<go to AfterDecodeRune>
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there would be no negative side-effects in this case. While the assert would fail if present, we rely on Rune.DecodeFromUtf8
to always consume at least 1 byte in each iteration, so the loop will eventually exit. bytesConsumed
being 1 instead of >= 2
will not impact memory accesses into the input.
I added a comment to this assert indicating the scenario under which it may fail.
In general, is the scenario of a string instance being modified after creation something that we should be/are considering? I would be surprised if there aren't other APIs across runtime that make the assumption of string immutability with worse side-effects.
The recent changes improve the throughput by another 10-20% depending on input.
|
@MihaZupan, what's the next step here? Are you waiting for reviews? Is it ready to merge? |
@GrabYourPitchforks, I assume no response means you're good with this now? Thanks. |
@MihaZupan, can you rebase this to resolve the conflicts? Thanks. |
70606d0
to
2a78dc8
Compare
Noticed this is the oldest Microsoft PR in the repo now. @GrabYourPitchforks any remaining feedback or is this ready to go? |
@scalablecory is your feedback addressed? Trying to get our 90th percentile PR age down.. |
Test failures are unrelated. Sent an email to @GrabYourPitchforks about finalizing the review here, otherwise I will be closing the PR until we can get someone to take a look at it. |
When unescaping percent-encoded non-ascii we currently:
This PR changes it into performing a single pass, writing to the ValueStringBuilder without temporary buffers.
Currently there is a behavioral change where before all hex characters would be upper-cased, now their input-casing is preserved. Keeping the old behavior is a trivial change with a bit of a perf penalty.
I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change.
Perf goes up significantly whenever this unescaping path is hit
(The allocation win is hit whenever there is a single non-ascii char in the input)
Updated benchmarks: #32552 (comment)