-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incorrect handling of character range and capitalization in regex #42282
Conversation
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley |
Do we have a microbenchmark around this that we could use to validate that the perf remains about the same? |
I'm actually waiting for the CI perf leg to tell me if it found regressions :) If IIRC, @stephentoub added the regex-redex benchmarks to the performance repo? |
Cool. I'd prefer not to sign off on this until we ensure there are benchmarks that hits both the |
There are a fair number of perf benchmarks now for Regex, including regex redux, but relatively few of them are going to perf test this code path (and I don't know if any will hit the else block). This is part of parsing the expression and will only be done once per expression, so benchmarks that test running the same regex over and over aren't going to see a degradation to this exercised. |
Ah, that's interesting; thanks, @stephentoub. Is it worth trying to get a measurement that truly exercises it, or are you satisfied based on the review? Calling |
I would like to understand the worst-case impact here, e.g. measuring |
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Outdated
Show resolved
Hide resolved
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Outdated
Show resolved
Hide resolved
@stephentoub do you remember the details of the issue with the Kelvin symbol that the https://github.com/AutomataDotNet/srm folks mentioned? I wonder whether this fixes it as I vaguely remember it was related to character ranges and capitalization. |
I opened the issue this is fixing based on that discussion. |
Alright I spent a bit of time investigating this more thoroughly and it turns out that this fix is incomplete. If we consider the following RegEx: var pattern = new Regex(@"^(?i:[\xD2-\xDC])$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
var match247 = pattern.IsMatch(((char)247).ToString()); For correct behavior, |
@pgovind are you special casing this character? Does that approach scale given how large the Unicode BMP is (and I assume it can change)? We already special case Turkish I, including in some cases where we shouldn't (eg., shoehorning it into the ECMAScript definition of a word when it isn't listed in the ECMAScript definition for |
Fortunately, we only need to test the ranges in |
Should that code run in a test, to protect your fixes? |
I thought about it. |
I think reflection in the tests is fine if you get good benefit and don't have another option. Nobody will break it except us - they'll discover it in CI if they do - then they can fix it or worst case delete the test, which leaves us no worse off. |
@@ -300,7 +300,8 @@ internal sealed class RegexCharClass | |||
private static readonly LowerCaseMapping[] s_lcTable = new LowerCaseMapping[] | |||
{ | |||
new LowerCaseMapping('\u0041', '\u005A', LowercaseAdd, 32), | |||
new LowerCaseMapping('\u00C0', '\u00DE', LowercaseAdd, 32), | |||
new LowerCaseMapping('\u00C0', '\u00D6', LowercaseAdd, 32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not you, but this file in general could use some better naming/commenting. Eg., a one line comment here explaining what each of the 4 entries are would be helpful. I look in the struct, and they're Chmin, Chmax, LcOp, Data -- which are all pretty bad names that could be improved.
Elsewhere the naming could be improved too. Eg., LowercaseBad
we would normally name something like LowerCaseBitwiseAnThenAddOne
or something, and we'd probably use an enum not consts.
Not suggesting you need to fix in this PR, in fact I think it's probably better one of us do it separately later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some spiel above this spot (L267-L293), but I agree with the comment. It wasn't obvious at all what LowercaseSet/Bor/Bad
did from the name :)
How long does the validation take? If not that long, I'd go this route, putting it in a debug-only static cctor. |
} | ||
else | ||
{ | ||
// Bug fix: Unicode `Symbol`s sometimes exist in the middle of character ranges. char.ToLower(Symbol) returns Symbol. In these cases, we cannot use an offset to find the lowercase chars. For ex: https://github.com/dotnet/runtime/issues/36149 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit about the "bug fix" part and the URL --
I once read someone's description of the ideal state of a mature codebase as one where the code appeared to have been written by a single highly productive perfect person in one sitting. The idea was the code is fully self consistent and it reads easily and intuitively. In that philosophy, a bug fix is an incremental modification designed to bring it closer to that ideal state. History matters, but that is what git blame and github search exists for. Historical commentary in a codebase often doesn't age well -- it can be irrelevant, become outdated (eg dead links), and looks scary to refactor.
I do think it's great to have comments like this where it's not obvious why the code is written a certain way - since that hypothetical perfect person would have written comments of that kind.
Tests to me are different - they're inevitably an idiosyncratic collection of independent chunks of code for particular bug fixes, etc and links and bug ID's there make more sense.
I know this is partly a matter of taste and @stephentoub has a slightly different preference, but clearly we're just picking places on a dial, since we make many bug fixes without including Github URL's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about the bug fix part; I feel the same way about tests as I do about comments in code, e.g. tests labeled "regression" really bug me :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it's good that the whole else block (containing the comment and URL) is going away then :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests labeled "regression" really bug me :-)
That's interesting, and I shall try to avoid them. Let's make our tests look like they were written by perfect people too!
For completeness, another option some tests use is to compile selected product files into themselves. That works well for an internal utility class with a well defined API ..maybe not here. |
Alright, I think this is pretty clean now. |
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Outdated
Show resolved
Hide resolved
LowerCaseMapping loc = s_lcTable[k]; | ||
if (loc.LcOp == 1) | ||
{ | ||
// Validate only the LowercaseAdd cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not validate all the table while you are at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, as part of validating the table I ran into this case. Results from the immediate window below:
uppercase
304 'İ'
culture.TextInfo.ToLower(uppercase)
304 'İ'
char.ToLower(uppercase) // This is what is stored in s_lcTable
105 'i'
culture
here (and in the validation code in this PR) is set to CultureInfo.InvariantCulture
. char.ToLower
uses CurrentCulture
instead. This behavior is related to #36147. Essentially, s_lcTable
seems to have its values populated for "en-US".
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Show resolved
Hide resolved
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Outdated
Show resolved
Hide resolved
7bde5c5
to
4d3e58d
Compare
Thanks for the review @stephentoub . The test failures were because NetFramework would fail the new unit tests, so I added a commit to skip them only on netframework. |
Seeing this error in CI now:
The |
It's probably going through the standard console codepage. Better to have it print out the actual hex. |
Just logging what I found here: > c01c5
'Dž'
> (int)c01c5
453
> System.Globalization.CultureInfo.InvariantCulture.TextInfo.ToLower(c01c5)
'Dž'
> System.Globalization.CultureInfo.InvariantCulture.TextInfo.ToLower(c01c5) == c01c6
false However, when I run the unit test on .NET Core and place a break point, I see this in the immediate window: > uppercase
453 'Dž'
> System.Globalization.CultureInfo.InvariantCulture.TextInfo.ToLower(uppercase)
454 'dž' Questions
Tagging @GrabYourPitchforks @tarekgh for some input? |
The behavior you are seeing with net core is correct as 01C5 maps to 01C6. CI properly didn't fail because we are using ICU for net core even running on Windows. which mean ICU mapping the character correctly while the Full Framework doesn't. |
Skip it on non-ICU environments
In addition to Tarek's great response, I recommend the two tools below if you're investigating casing issues:
These tools give you a whole slew of info, including any special characteristics that these code points have and what their various-case representations are. |
Alright, so Tarek helped me figure out that the CI errors I was seeing was because CI was using a WIndows 8 machine that wasn't using ICU. The easiest way to limit the validation to ICU enabled machines was using attributes defined in TestUtilities, so I moved the validation code to a unit test and included I went ahead and merged the PR. We can always make changes if we want to later. |
Fixes #36149
Description of the bug is in the issue. Fix:
AddLowercaseRange
assumes that once an appropriate uppercase letters range is found, the lowercase letters can be found by using an offset. However, as this bug shows, this is not always the case. For ex: The char '\xD7' is a Symbol whose lowercase value is still '\xD7'. I expect that such inputs are rare, so adding a simple for loop here sounds like a good fix.