Fix incorrect handling of character range and capitalization in regex #42282

pgovind · 2020-09-15T22:52:23Z

Description of the bug is in the issue. Fix: AddLowercaseRange assumes that once an appropriate uppercase letters range is found, the lowercase letters can be found by using an offset. However, as this bug shows, this is not always the case. For ex: The char '\xD7' is a Symbol whose lowercase value is still '\xD7'. I expect that such inputs are rare, so adding a simple for loop here sounds like a good fix.

Dotnet-GitSync-Bot · 2020-09-15T22:52:26Z

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2020-09-15T22:52:59Z

Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley
See info in area-owners.md if you want to be subscribed.

jeffhandley · 2020-09-15T23:25:05Z

Do we have a microbenchmark around this that we could use to validate that the perf remains about the same?

pgovind · 2020-09-15T23:33:39Z

Do we have a microbenchmark around this that we could use to validate that the perf remains about the same?

I'm actually waiting for the CI perf leg to tell me if it found regressions :) If IIRC, @stephentoub added the regex-redex benchmarks to the performance repo?

jeffhandley · 2020-09-15T23:53:36Z

I'm actually waiting for the CI perf leg to tell me if it found regressions

Cool. I'd prefer not to sign off on this until we ensure there are benchmarks that hits both the if and the else blocks.

stephentoub · 2020-09-16T00:34:37Z

There are a fair number of perf benchmarks now for Regex, including regex redux, but relatively few of them are going to perf test this code path (and I don't know if any will hit the else block). This is part of parsing the expression and will only be done once per expression, so benchmarks that test running the same regex over and over aren't going to see a degradation to this exercised.

jeffhandley · 2020-09-16T00:40:21Z

Ah, that's interesting; thanks, @stephentoub. Is it worth trying to get a measurement that truly exercises it, or are you satisfied based on the review? Calling AddRange(lowerInRange, lowerInRange); inside a loop made me curious.

stephentoub · 2020-09-16T02:46:44Z

I would like to understand the worst-case impact here, e.g. measuring new Regex(...).IsMatch(...) with a pattern that exhibits the problematic case.

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

danmoseley · 2020-09-16T16:36:24Z

@stephentoub do you remember the details of the issue with the Kelvin symbol that the https://github.com/AutomataDotNet/srm folks mentioned? I wonder whether this fixes it as I vaguely remember it was related to character ranges and capitalization.

stephentoub · 2020-09-16T21:55:44Z

do you remember the details of the issue with the Kelvin symbol that the https://github.com/AutomataDotNet/srm folks mentioned?

I opened the issue this is fixing based on that discussion.

pgovind · 2020-09-16T22:51:43Z

Alright I spent a bit of time investigating this more thoroughly and it turns out that this fix is incomplete. If we consider the following RegEx:

            var pattern = new Regex(@"^(?i:[\xD2-\xDC])$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
            var match247 = pattern.IsMatch(((char)247).ToString());

For correct behavior, match247 should be false. However, we return true today (even with this PR). The reason this PR is incomplete is because the problematic char is the symbol ÷ (decimal value 215). \xD2 - \xDC is the range 210 - 220. The corresponding lowercase range is 242 - 252. However, it is NOT a 1o1 mapping. The uppercase chars 210 - 214 match the lowercase chars 242 - 246. The uppercase chars 216-220 match the lowercase chars 248-252. Uppercase char 215's lowercase char is still 215. I miss this case with the simple if (range.Last - range.First == upper - lower) check. However, an elegant fix does exist here. s_lcTable itself can be modified to split the range \u00C0 - \u00DE into \u00C0 - \u00D6 and \u00D8 - \u00DE. I'll push the proper fix and updated unit tests shortly.

danmoseley · 2020-09-16T23:32:47Z

@pgovind are you special casing this character? Does that approach scale given how large the Unicode BMP is (and I assume it can change)? We already special case Turkish I, including in some cases where we shouldn't (eg., shoehorning it into the ECMAScript definition of a word when it isn't listed in the ECMAScript definition for \w).

pgovind · 2020-09-17T22:35:24Z

Does that approach scale given how large the Unicode BMP is

Fortunately, we only need to test the ranges in s_lcTable that are tagged as LowercaseAdd. I wrote some simple code to find cases where char.ToLower(AChar) != UpperCaseAChar + offset from the ranges in s_lcTable and fixed all the cases I found were wrong.

danmoseley · 2020-09-17T23:01:49Z

I wrote some simple code to find cases where char.ToLower(AChar) != UpperCaseAChar + offset from the ranges in s_lcTable and fixed all the cases I found were wrong.

Should that code run in a test, to protect your fixes?

pgovind · 2020-09-18T00:06:22Z

Should that code run in a test, to protect your fixes?

I thought about it. RegexCharClass.s_lcTable is private static readonly though. We usually avoid using reflection in the unit tests right? Think this qualifies for an exception? The alternative is to have a #if DEBUG {validation code} #endif.

danmoseley · 2020-09-18T00:08:18Z

I think reflection in the tests is fine if you get good benefit and don't have another option. Nobody will break it except us - they'll discover it in CI if they do - then they can fix it or worst case delete the test, which leaves us no worse off.

danmoseley · 2020-09-18T00:10:42Z

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

@@ -300,7 +300,8 @@ internal sealed class RegexCharClass
        private static readonly LowerCaseMapping[] s_lcTable = new LowerCaseMapping[]
        {
            new LowerCaseMapping('\u0041', '\u005A', LowercaseAdd, 32),
-            new LowerCaseMapping('\u00C0', '\u00DE', LowercaseAdd, 32),
+            new LowerCaseMapping('\u00C0', '\u00D6', LowercaseAdd, 32),


Not you, but this file in general could use some better naming/commenting. Eg., a one line comment here explaining what each of the 4 entries are would be helpful. I look in the struct, and they're Chmin, Chmax, LcOp, Data -- which are all pretty bad names that could be improved.

Elsewhere the naming could be improved too. Eg., LowercaseBad we would normally name something like LowerCaseBitwiseAnThenAddOne or something, and we'd probably use an enum not consts.

Not suggesting you need to fix in this PR, in fact I think it's probably better one of us do it separately later.

There's some spiel above this spot (L267-L293), but I agree with the comment. It wasn't obvious at all what LowercaseSet/Bor/Bad did from the name :)

stephentoub · 2020-09-18T00:10:55Z

The alternative is to have a #if DEBUG {validation code} #endif

How long does the validation take? If not that long, I'd go this route, putting it in a debug-only static cctor.

danmoseley · 2020-09-18T00:26:53Z

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

+                        }
+                        else
+                        {
+                            // Bug fix: Unicode `Symbol`s sometimes exist in the middle of character ranges. char.ToLower(Symbol) returns Symbol. In these cases, we cannot use an offset to find the lowercase chars. For ex: https://github.com/dotnet/runtime/issues/36149


Nit about the "bug fix" part and the URL --

I once read someone's description of the ideal state of a mature codebase as one where the code appeared to have been written by a single highly productive perfect person in one sitting. The idea was the code is fully self consistent and it reads easily and intuitively. In that philosophy, a bug fix is an incremental modification designed to bring it closer to that ideal state. History matters, but that is what git blame and github search exists for. Historical commentary in a codebase often doesn't age well -- it can be irrelevant, become outdated (eg dead links), and looks scary to refactor.

I do think it's great to have comments like this where it's not obvious why the code is written a certain way - since that hypothetical perfect person would have written comments of that kind.

Tests to me are different - they're inevitably an idiosyncratic collection of independent chunks of code for particular bug fixes, etc and links and bug ID's there make more sense.

I know this is partly a matter of taste and @stephentoub has a slightly different preference, but clearly we're just picking places on a dial, since we make many bug fixes without including Github URL's

I agree about the bug fix part; I feel the same way about tests as I do about comments in code, e.g. tests labeled "regression" really bug me :-)

Well it's good that the whole else block (containing the comment and URL) is going away then :D

tests labeled "regression" really bug me :-)

That's interesting, and I shall try to avoid them. Let's make our tests look like they were written by perfect people too!

danmoseley · 2020-09-18T00:47:35Z

The alternative is to have a #if DEBUG {validation code} #endif.

For completeness, another option some tests use is to compile selected product files into themselves. That works well for an internal utility class with a well defined API ..maybe not here.

pgovind · 2020-09-18T22:33:18Z

Alright, I think this is pretty clean now.

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

danmoseley · 2020-09-19T19:32:55Z

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

+                LowerCaseMapping loc = s_lcTable[k];
+                if (loc.LcOp == 1)
+                {
+                    // Validate only the LowercaseAdd cases


Why not validate all the table while you are at it?

Good idea. Done.

Hmm, as part of validating the table I ran into this case. Results from the immediate window below:

uppercase 304 'İ' culture.TextInfo.ToLower(uppercase) 304 'İ' char.ToLower(uppercase) // This is what is stored in s_lcTable 105 'i'

culture here (and in the validation code in this PR) is set to CultureInfo.InvariantCulture. char.ToLower uses CurrentCulture instead. This behavior is related to #36147. Essentially, s_lcTable seems to have its values populated for "en-US".

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

src/libraries/System.Text.RegularExpressions/tests/Regex.Match.Tests.cs

pgovind · 2020-10-26T18:54:12Z

Thanks for the review @stephentoub . The test failures were because NetFramework would fail the new unit tests, so I added a commit to skip them only on netframework.

pgovind · 2020-10-26T21:50:56Z

Seeing this error in CI now:

Process terminated. Assertion failed.
The Unicode character range at index 38 in s_lcTable contains the character ? (decimal value: 453). Its lowercase value ? is not the stored value ?.
   at System.Text.RegularExpressions.RegexCharClass..cctor() in /_/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs:line 437
   at System.Text.RegularExpressions.RegexCharClass.IsWordChar(Char ch) in /_/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs:line 1077

The ? chars don't make much sense, they should be printing out the character 'ǅ'. Maybe the editor used there doesn't support full UTF? In any case, that is not important. The validation succeeds locally on my machine but does not in CI, even though we're using CultureInfo.InvariantCulture (which should mean the validation is independent of the CI machine's local culture). Investigating ...

danmoseley · 2020-10-26T22:04:36Z

The ? chars don't make much sense, they should be printing out the character 'ǅ'. Maybe the editor used there doesn't support full UTF?

It's probably going through the standard console codepage. Better to have it print out the actual hex.

pgovind · 2020-10-28T21:45:01Z

Just logging what I found here:
The character failing validation is 'ǅ' (decimal value: 453, hex: 1C5). In my C# Interactive Window (.NET Framework), I see this:

> c01c5
'ǅ'
> (int)c01c5
453
> System.Globalization.CultureInfo.InvariantCulture.TextInfo.ToLower(c01c5)
'ǅ'
> System.Globalization.CultureInfo.InvariantCulture.TextInfo.ToLower(c01c5) == c01c6
false

However, when I run the unit test on .NET Core and place a break point, I see this in the immediate window:

> uppercase
453 'ǅ'
> System.Globalization.CultureInfo.InvariantCulture.TextInfo.ToLower(uppercase)
454 'ǆ'

Questions

Maybe Framework and Core have different behavior for System.Globalization.CultureInfo.InvariantCulture.TextInfo.ToLower? I'm not sure.
Even if we ignore Framework, this run on CI is on .NET Core, so we should be seeing the same behavior as .NET Core on my machine and the validation should pass...Still don't understand what is different on CI :/

Tagging @GrabYourPitchforks @tarekgh for some input?

tarekgh · 2020-10-28T22:03:44Z

@pgovind

01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON;Lt;0;L;<compat> 0044 017E;;;;N;LATIN LETTER CAPITAL D SMALL Z HACEK;;01C4;01C6;01C5

The behavior you are seeing with net core is correct as 01C5 maps to 01C6.
When running on C# Interactive Window, does it depend on the Ful Framework installed on your machine or it run on some host has a different version of the framework? you can try running the case mapping code on the Full Framework in your machine and look if you see the same result.

CI properly didn't fail because we are using ICU for net core even running on Windows. which mean ICU mapping the character correctly while the Full Framework doesn't.

Skip it on non-ICU environments

GrabYourPitchforks · 2020-10-29T22:44:33Z

In addition to Tarek's great response, I recommend the two tools below if you're investigating casing issues:

https://util.unicode.org/UnicodeJsps/character.jsp?a=01C5+&B1=Show (put the char or the hex representation into the textbox)
https://www.fileformat.info/info/unicode/char/01c5/index.htm (put the hex representation into the URL)

These tools give you a whole slew of info, including any special characteristics that these code points have and what their various-case representations are.

pgovind · 2020-10-30T18:22:02Z

Alright, so Tarek helped me figure out that the CI errors I was seeing was because CI was using a WIndows 8 machine that wasn't using ICU. The easiest way to limit the validation to ICU enabled machines was using attributes defined in TestUtilities, so I moved the validation code to a unit test and included RegexCharClass.MappingTable.cs in the test csproj. Since some code changed from when you approved the PR, tagging @stephentoub here. I don't expect any comments, but just in case :)

I went ahead and merged the PR. We can always make changes if we want to later.

pgovind added the area-System.Text.RegularExpressions label Sep 15, 2020

pgovind requested review from stephentoub, eerhardt and danmoseley September 15, 2020 22:53

pgovind added this to the Future milestone Sep 15, 2020

stephentoub reviewed Sep 16, 2020

View reviewed changes

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs Outdated Show resolved Hide resolved

stephentoub reviewed Sep 16, 2020

View reviewed changes

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs Outdated Show resolved Hide resolved

danmoseley changed the title ~~Bug fix and unit test~~ Fix incorrect handling of character range and capitalization in regex Sep 16, 2020

danmoseley reviewed Sep 18, 2020

View reviewed changes

danmoseley reviewed Sep 19, 2020

View reviewed changes

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs Outdated Show resolved Hide resolved

danmoseley reviewed Sep 19, 2020

View reviewed changes

stephentoub reviewed Sep 21, 2020

View reviewed changes

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs Show resolved Hide resolved

stephentoub reviewed Sep 21, 2020

View reviewed changes

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs Outdated Show resolved Hide resolved

stephentoub reviewed Sep 21, 2020

View reviewed changes

src/libraries/System.Text.RegularExpressions/tests/Regex.Match.Tests.cs Show resolved Hide resolved

stephentoub approved these changes Oct 23, 2020

View reviewed changes

Prashanth Govindarajan added 9 commits October 26, 2020 09:56

Bug fix and unit test

eadf110

Address comments

e1c3219

An elegant fix and unit tests

a7c8f9e

Remove old code

db6b697

Fix the unit tests

6895ea7

sq

48fbce4

sq

0a43316

Address comments and add test case

165cbab

Skip the unit tests on net framework

4d3e58d

pgovind force-pushed the Fix_regex_range branch from 7bde5c5 to 4d3e58d Compare October 26, 2020 18:51

runfoapp bot mentioned this pull request Oct 26, 2020

Mono SIGABRT in System.Drawing.Common affecting some test runs #37838

Closed

Debug commit for CI

1aa54b0

Try to get more error info

04da0a3

Run validation as a unit test.

f42e2a0

Skip it on non-ICU environments

runfoapp bot mentioned this pull request Oct 30, 2020

System.Buffers.ArrayPool.Tests.CollectionTests.PollingEventFires failing in CI #44037

Open

pgovind merged commit 7c9c347 into dotnet:master Nov 2, 2020

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Fix incorrect handling of character range and capitalization in regex #42282

Fix incorrect handling of character range and capitalization in regex #42282

Conversation

pgovind commented Sep 15, 2020

Dotnet-GitSync-Bot commented Sep 15, 2020

ghost commented Sep 15, 2020

jeffhandley commented Sep 15, 2020

pgovind commented Sep 15, 2020

jeffhandley commented Sep 15, 2020

stephentoub commented Sep 16, 2020

jeffhandley commented Sep 16, 2020

stephentoub commented Sep 16, 2020

danmoseley commented Sep 16, 2020

stephentoub commented Sep 16, 2020

pgovind commented Sep 16, 2020

danmoseley commented Sep 16, 2020 • edited Loading

pgovind commented Sep 17, 2020

danmoseley commented Sep 17, 2020

pgovind commented Sep 18, 2020

danmoseley commented Sep 18, 2020

danmoseley Sep 18, 2020

Choose a reason for hiding this comment

pgovind Sep 18, 2020

Choose a reason for hiding this comment

stephentoub commented Sep 18, 2020 • edited Loading

danmoseley Sep 18, 2020 • edited Loading

Choose a reason for hiding this comment

stephentoub Sep 18, 2020 • edited Loading

Choose a reason for hiding this comment

pgovind Sep 18, 2020

Choose a reason for hiding this comment

danmoseley Sep 18, 2020

Choose a reason for hiding this comment

danmoseley commented Sep 18, 2020

pgovind commented Sep 18, 2020

danmoseley Sep 19, 2020

Choose a reason for hiding this comment

pgovind Sep 21, 2020

Choose a reason for hiding this comment

pgovind Sep 21, 2020

Choose a reason for hiding this comment

pgovind commented Oct 26, 2020 • edited Loading

pgovind commented Oct 26, 2020 • edited Loading

danmoseley commented Oct 26, 2020

pgovind commented Oct 28, 2020 • edited Loading

tarekgh commented Oct 28, 2020

GrabYourPitchforks commented Oct 29, 2020

pgovind commented Oct 30, 2020 • edited Loading

danmoseley commented Sep 16, 2020 •

edited

Loading

stephentoub commented Sep 18, 2020 •

edited

Loading

danmoseley Sep 18, 2020 •

edited

Loading

stephentoub Sep 18, 2020 •

edited

Loading

pgovind commented Oct 26, 2020 •

edited

Loading

pgovind commented Oct 26, 2020 •

edited

Loading

pgovind commented Oct 28, 2020 •

edited

Loading

pgovind commented Oct 30, 2020 •

edited

Loading