Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect handling of character range and capitalization in regex #42282

Merged
merged 12 commits into from
Nov 2, 2020

Conversation

pgovind
Copy link

@pgovind pgovind commented Sep 15, 2020

Fixes #36149

Description of the bug is in the issue. Fix: AddLowercaseRange assumes that once an appropriate uppercase letters range is found, the lowercase letters can be found by using an offset. However, as this bug shows, this is not always the case. For ex: The char '\xD7' is a Symbol whose lowercase value is still '\xD7'. I expect that such inputs are rare, so adding a simple for loop here sounds like a good fix.

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Sep 15, 2020

Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@pgovind pgovind added this to the Future milestone Sep 15, 2020
@jeffhandley
Copy link
Member

Do we have a microbenchmark around this that we could use to validate that the perf remains about the same?

@pgovind
Copy link
Author

pgovind commented Sep 15, 2020

Do we have a microbenchmark around this that we could use to validate that the perf remains about the same?

I'm actually waiting for the CI perf leg to tell me if it found regressions :) If IIRC, @stephentoub added the regex-redex benchmarks to the performance repo?

@jeffhandley
Copy link
Member

I'm actually waiting for the CI perf leg to tell me if it found regressions

Cool. I'd prefer not to sign off on this until we ensure there are benchmarks that hits both the if and the else blocks.

@stephentoub
Copy link
Member

There are a fair number of perf benchmarks now for Regex, including regex redux, but relatively few of them are going to perf test this code path (and I don't know if any will hit the else block). This is part of parsing the expression and will only be done once per expression, so benchmarks that test running the same regex over and over aren't going to see a degradation to this exercised.

@jeffhandley
Copy link
Member

Ah, that's interesting; thanks, @stephentoub. Is it worth trying to get a measurement that truly exercises it, or are you satisfied based on the review? Calling AddRange(lowerInRange, lowerInRange); inside a loop made me curious.

@stephentoub
Copy link
Member

I would like to understand the worst-case impact here, e.g. measuring new Regex(...).IsMatch(...) with a pattern that exhibits the problematic case.

@danmoseley danmoseley changed the title Bug fix and unit test Fix incorrect handling of character range and capitalization in regex Sep 16, 2020
@danmoseley
Copy link
Member

@stephentoub do you remember the details of the issue with the Kelvin symbol that the https://github.com/AutomataDotNet/srm folks mentioned? I wonder whether this fixes it as I vaguely remember it was related to character ranges and capitalization.

@stephentoub
Copy link
Member

do you remember the details of the issue with the Kelvin symbol that the https://github.com/AutomataDotNet/srm folks mentioned?

I opened the issue this is fixing based on that discussion.

@pgovind
Copy link
Author

pgovind commented Sep 16, 2020

Alright I spent a bit of time investigating this more thoroughly and it turns out that this fix is incomplete. If we consider the following RegEx:

            var pattern = new Regex(@"^(?i:[\xD2-\xDC])$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
            var match247 = pattern.IsMatch(((char)247).ToString());

For correct behavior, match247 should be false. However, we return true today (even with this PR). The reason this PR is incomplete is because the problematic char is the symbol ÷ (decimal value 215). \xD2 - \xDC is the range 210 - 220. The corresponding lowercase range is 242 - 252. However, it is NOT a 1o1 mapping. The uppercase chars 210 - 214 match the lowercase chars 242 - 246. The uppercase chars 216-220 match the lowercase chars 248-252. Uppercase char 215's lowercase char is still 215. I miss this case with the simple if (range.Last - range.First == upper - lower) check. However, an elegant fix does exist here. s_lcTable itself can be modified to split the range \u00C0 - \u00DE into \u00C0 - \u00D6 and \u00D8 - \u00DE. I'll push the proper fix and updated unit tests shortly.

@danmoseley
Copy link
Member

danmoseley commented Sep 16, 2020

@pgovind are you special casing this character? Does that approach scale given how large the Unicode BMP is (and I assume it can change)? We already special case Turkish I, including in some cases where we shouldn't (eg., shoehorning it into the ECMAScript definition of a word when it isn't listed in the ECMAScript definition for \w).

@pgovind
Copy link
Author

pgovind commented Sep 17, 2020

Does that approach scale given how large the Unicode BMP is

Fortunately, we only need to test the ranges in s_lcTable that are tagged as LowercaseAdd. I wrote some simple code to find cases where char.ToLower(AChar) != UpperCaseAChar + offset from the ranges in s_lcTable and fixed all the cases I found were wrong.

@danmoseley
Copy link
Member

I wrote some simple code to find cases where char.ToLower(AChar) != UpperCaseAChar + offset from the ranges in s_lcTable and fixed all the cases I found were wrong.

Should that code run in a test, to protect your fixes?

@pgovind
Copy link
Author

pgovind commented Sep 18, 2020

Should that code run in a test, to protect your fixes?

I thought about it. RegexCharClass.s_lcTable is private static readonly though. We usually avoid using reflection in the unit tests right? Think this qualifies for an exception? The alternative is to have a #if DEBUG {validation code} #endif.

@danmoseley
Copy link
Member

I think reflection in the tests is fine if you get good benefit and don't have another option. Nobody will break it except us - they'll discover it in CI if they do - then they can fix it or worst case delete the test, which leaves us no worse off.

@@ -300,7 +300,8 @@ internal sealed class RegexCharClass
private static readonly LowerCaseMapping[] s_lcTable = new LowerCaseMapping[]
{
new LowerCaseMapping('\u0041', '\u005A', LowercaseAdd, 32),
new LowerCaseMapping('\u00C0', '\u00DE', LowercaseAdd, 32),
new LowerCaseMapping('\u00C0', '\u00D6', LowercaseAdd, 32),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not you, but this file in general could use some better naming/commenting. Eg., a one line comment here explaining what each of the 4 entries are would be helpful. I look in the struct, and they're Chmin, Chmax, LcOp, Data -- which are all pretty bad names that could be improved.

Elsewhere the naming could be improved too. Eg., LowercaseBad we would normally name something like LowerCaseBitwiseAnThenAddOne or something, and we'd probably use an enum not consts.

Not suggesting you need to fix in this PR, in fact I think it's probably better one of us do it separately later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some spiel above this spot (L267-L293), but I agree with the comment. It wasn't obvious at all what LowercaseSet/Bor/Bad did from the name :)

@stephentoub
Copy link
Member

stephentoub commented Sep 18, 2020

The alternative is to have a #if DEBUG {validation code} #endif

How long does the validation take? If not that long, I'd go this route, putting it in a debug-only static cctor.

}
else
{
// Bug fix: Unicode `Symbol`s sometimes exist in the middle of character ranges. char.ToLower(Symbol) returns Symbol. In these cases, we cannot use an offset to find the lowercase chars. For ex: https://github.com/dotnet/runtime/issues/36149
Copy link
Member

@danmoseley danmoseley Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit about the "bug fix" part and the URL --

I once read someone's description of the ideal state of a mature codebase as one where the code appeared to have been written by a single highly productive perfect person in one sitting. The idea was the code is fully self consistent and it reads easily and intuitively. In that philosophy, a bug fix is an incremental modification designed to bring it closer to that ideal state. History matters, but that is what git blame and github search exists for. Historical commentary in a codebase often doesn't age well -- it can be irrelevant, become outdated (eg dead links), and looks scary to refactor.

I do think it's great to have comments like this where it's not obvious why the code is written a certain way - since that hypothetical perfect person would have written comments of that kind.

Tests to me are different - they're inevitably an idiosyncratic collection of independent chunks of code for particular bug fixes, etc and links and bug ID's there make more sense.

I know this is partly a matter of taste and @stephentoub has a slightly different preference, but clearly we're just picking places on a dial, since we make many bug fixes without including Github URL's

Copy link
Member

@stephentoub stephentoub Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about the bug fix part; I feel the same way about tests as I do about comments in code, e.g. tests labeled "regression" really bug me :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's good that the whole else block (containing the comment and URL) is going away then :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests labeled "regression" really bug me :-)

That's interesting, and I shall try to avoid them. Let's make our tests look like they were written by perfect people too!

@danmoseley
Copy link
Member

The alternative is to have a #if DEBUG {validation code} #endif.

For completeness, another option some tests use is to compile selected product files into themselves. That works well for an internal utility class with a well defined API ..maybe not here.

@pgovind
Copy link
Author

pgovind commented Sep 18, 2020

Alright, I think this is pretty clean now.

LowerCaseMapping loc = s_lcTable[k];
if (loc.LcOp == 1)
{
// Validate only the LowercaseAdd cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not validate all the table while you are at it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, as part of validating the table I ran into this case. Results from the immediate window below:

uppercase
304 'İ'
culture.TextInfo.ToLower(uppercase)
304 'İ'
char.ToLower(uppercase) // This is what is stored in s_lcTable
105 'i'

culture here (and in the validation code in this PR) is set to CultureInfo.InvariantCulture. char.ToLower uses CurrentCulture instead. This behavior is related to #36147. Essentially, s_lcTable seems to have its values populated for "en-US".

@pgovind
Copy link
Author

pgovind commented Oct 26, 2020

Thanks for the review @stephentoub . The test failures were because NetFramework would fail the new unit tests, so I added a commit to skip them only on netframework.

@pgovind
Copy link
Author

pgovind commented Oct 26, 2020

Seeing this error in CI now:

Process terminated. Assertion failed.
The Unicode character range at index 38 in s_lcTable contains the character ? (decimal value: 453). Its lowercase value ? is not the stored value ?.
   at System.Text.RegularExpressions.RegexCharClass..cctor() in /_/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs:line 437
   at System.Text.RegularExpressions.RegexCharClass.IsWordChar(Char ch) in /_/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs:line 1077

The ? chars don't make much sense, they should be printing out the character 'Dž'. Maybe the editor used there doesn't support full UTF? In any case, that is not important. The validation succeeds locally on my machine but does not in CI, even though we're using CultureInfo.InvariantCulture (which should mean the validation is independent of the CI machine's local culture). Investigating ...

@danmoseley
Copy link
Member

The ? chars don't make much sense, they should be printing out the character 'Dž'. Maybe the editor used there doesn't support full UTF?

It's probably going through the standard console codepage. Better to have it print out the actual hex.

@pgovind
Copy link
Author

pgovind commented Oct 28, 2020

Just logging what I found here:
The character failing validation is 'Dž' (decimal value: 453, hex: 1C5). In my C# Interactive Window (.NET Framework), I see this:

> c01c5
'Dž'
> (int)c01c5
453
> System.Globalization.CultureInfo.InvariantCulture.TextInfo.ToLower(c01c5)
'Dž'
> System.Globalization.CultureInfo.InvariantCulture.TextInfo.ToLower(c01c5) == c01c6
false

However, when I run the unit test on .NET Core and place a break point, I see this in the immediate window:

> uppercase
453 'Dž'
> System.Globalization.CultureInfo.InvariantCulture.TextInfo.ToLower(uppercase)
454 'dž'

Questions

  1. Maybe Framework and Core have different behavior for System.Globalization.CultureInfo.InvariantCulture.TextInfo.ToLower? I'm not sure.
  2. Even if we ignore Framework, this run on CI is on .NET Core, so we should be seeing the same behavior as .NET Core on my machine and the validation should pass...Still don't understand what is different on CI :/

Tagging @GrabYourPitchforks @tarekgh for some input?

@tarekgh
Copy link
Member

tarekgh commented Oct 28, 2020

@pgovind

01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON;Lt;0;L;<compat> 0044 017E;;;;N;LATIN LETTER CAPITAL D SMALL Z HACEK;;01C4;01C6;01C5

The behavior you are seeing with net core is correct as 01C5 maps to 01C6.
When running on C# Interactive Window, does it depend on the Ful Framework installed on your machine or it run on some host has a different version of the framework? you can try running the case mapping code on the Full Framework in your machine and look if you see the same result.

CI properly didn't fail because we are using ICU for net core even running on Windows. which mean ICU mapping the character correctly while the Full Framework doesn't.

Skip it on non-ICU environments
@GrabYourPitchforks
Copy link
Member

In addition to Tarek's great response, I recommend the two tools below if you're investigating casing issues:

These tools give you a whole slew of info, including any special characteristics that these code points have and what their various-case representations are.

@pgovind
Copy link
Author

pgovind commented Oct 30, 2020

Alright, so Tarek helped me figure out that the CI errors I was seeing was because CI was using a WIndows 8 machine that wasn't using ICU. The easiest way to limit the validation to ICU enabled machines was using attributes defined in TestUtilities, so I moved the validation code to a unit test and included RegexCharClass.MappingTable.cs in the test csproj. Since some code changed from when you approved the PR, tagging @stephentoub here. I don't expect any comments, but just in case :)

I went ahead and merged the PR. We can always make changes if we want to later.

@pgovind pgovind merged commit 7c9c347 into dotnet:master Nov 2, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regex is incorrectly handling casing of some ranges
7 participants