Skip to content

Commit

Permalink
Changing the logic for how we deal with RegexOptions.IgnoreCase match…
Browse files Browse the repository at this point in the history
…ing. (#67184)

* Changing the logic for how we deal with RegexOptions.IgnoreCase
matching.

* Addressing first round of feedback

* Addressing more feedback.

* - Ensure that Backreferences use the same case behavior that the casing table does when using IgnoreCase.
- Addressing more feedback.

* Apply suggestions from code review

Co-authored-by: Stephen Toub <stoub@microsoft.com>

* Address more feedback

* Fix allocation regression for patterns with a lot of ascii letters

* Skip few tests in Browser and .NET Framework

* Skip one more test that shouldn't be ran on wasm

* Address more PR Feedback

* More feedback

* Skip tests that are failing in NLS-globalization queues

Co-authored-by: Stephen Toub <stoub@microsoft.com>
  • Loading branch information
joperezr and stephentoub authored Apr 6, 2022
1 parent b4c76da commit 90908d5
Show file tree
Hide file tree
Showing 44 changed files with 2,281 additions and 1,800 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Instructions for updating Unicode version in dotnet/runtime

## Table of Contents

- [Instructions for updating Unicode version in dotnet/runtime](#instructions-for-updating-unicode-version-in-dotnetruntime)
- [Table of Contents](#table-of-contents)
- [Overview](#overview)
Expand All @@ -24,8 +25,7 @@ This repository has several places that need to be updated when we are ingesting
- extracted/DerivedBidiClass.txt
- extracted/DerivedName.txt

2. Once you have downloaded all those files, create a fork of the repo https://github.com/dotnet/runtime-assets and send a PR which creates a folder at `src/System.Private.Runtime.UnicodeData/<YourUnicodeVersion>` and places all of the downloaded files from step 1 there. You can look at a sample PR that did this for Unicode 14.0.0 here: https://github.com/dotnet/runtime-assets/pull/179

2. Once you have downloaded all those files, create a fork of the repo <https://github.com/dotnet/runtime-assets> and send a PR which creates a folder at `src/System.Private.Runtime.UnicodeData/<YourUnicodeVersion>` and places all of the downloaded files from step 1 there. You can look at a sample PR that did this for Unicode 14.0.0 here: <https://github.com/dotnet/runtime-assets/pull/179>

## Ingest the created package into dotnet/runtime repo

Expand All @@ -42,6 +42,6 @@ This should be done automatically by dependency-flow, so in theory there shouldn
- System.Globalization.Nls.Tests.csproj
- System.Text.Encodings.Web.Tests.csproj
4. If the new Unicode data contains casing changes/updates, then we will also need to update `src/coreclr/pal/src/locale/unicodedata.cpp` file. This file is used by most of the reflection stack whenever you specify the `BindingFlags.IgnoreCase`. In order to regenerate the contents of the `unicdedata.cpp` file, you need to run the Program located at `src/coreclr/pal/src/locale/unicodedata.cs` and give a full path to the new UnicodeData.txt as a parameter.
5. If the new Unicode data made changes on what character class a specific character belongs to, or added new characters, you may need to update the serialized Unicode character classes data in `System.Text.RegularExpressions` for the `NonBacktracking` engine. The telling sign that will show you if you need to do this, is if any tests are failing in the `System.Text.RegularExpressions.Tests` test project. In case some tests do fail (which means you need to update the serialized mappings), you will need to edit the file `src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexExperiment.cs` and set the `Enabled` bool to `true`, and re-run the RegexTests. This will generate a couple of files in your `%temp%` directory: `IgnoreCaseRelation.cs` and `UnicodeCategoryRanges.cs`. These files will need to be copied (and overwrite the existing ones) to the folder `src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Unicode/`
5. Update the Regex casing equivalence table using the UnicodeData.txt file from the new Unicode version. You can find the instructions on how to do this [here](../../../System.Text.RegularExpressions/tools/Readme.md).
6. Finally, last step is to update the license for the Unicode data into our [Third party notices](../../../../../THIRD-PARTY-NOTICES.TXT) by copying the contents located in `https://www.unicode.org/license.html` to the section that has the Unicode license in our notices.
7. That's it, now commit all of the changed files, and send a PR into dotnet/runtime with the updates. If there were any special things you had to do that are not noted on this document, PLEASE UPDATE THESE INSTRUCTIONS to facilitate future updates.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,10 @@
<Compile Include="$(CoreLibSharedDir)System\Collections\Generic\ValueListBuilder.cs" Link="Production\ValueListBuilder.cs" />
<Compile Include="..\src\System\Collections\Generic\ValueListBuilder.Pop.cs" Link="Production\ValueListBuilder.Pop.cs" />
<Compile Include="..\src\System\Threading\StackHelper.cs" Link="Production\StackHelper.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexCaseEquivalences.Data.cs" Link="Production\RegexCaseEquivalences.Data.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexCaseEquivalences.cs" Link="Production\RegexCaseEquivalences.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexCaseBehavior.cs" Link="Production\RegexCaseBehavior.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexCharClass.cs" Link="Production\RegexCharClass.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexCharClass.MappingTable.cs" Link="Production\RegexCharClass.MappingTable.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexFindOptimizations.cs" Link="Production\RegexFindOptimizations.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexNode.cs" Link="Production\RegexNode.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexNodeKind.cs" Link="Production\RegexNodeKind.cs" />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,10 @@
<Compile Include="System\Text\RegularExpressions\Regex.Replace.cs" />
<Compile Include="System\Text\RegularExpressions\Regex.Split.cs" />
<Compile Include="System\Text\RegularExpressions\Regex.Timeout.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCaseBehavior.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCaseEquivalences.Data.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCaseEquivalences.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCharClass.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCharClass.MappingTable.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCompilationInfo.cs" />
<Compile Include="System\Text\RegularExpressions\RegexFindOptimizations.cs" />
<Compile Include="System\Text\RegularExpressions\RegexGeneratorAttribute.cs" />
Expand Down Expand Up @@ -83,10 +85,6 @@
<Compile Include="System\Text\RegularExpressions\Symbolic\SymbolicRegexSet.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\TransitionRegex.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\TransitionRegexKind.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\GeneratorHelper.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\IgnoreCaseRelation.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\IgnoreCaseRelationGenerator.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\IgnoreCaseTransformer.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\UnicodeCategoryRanges.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\UnicodeCategoryRangesGenerator.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\UnicodeCategoryTheory.cs" />
Expand Down
Original file line number Diff line number Diff line change
@@ -1,17 +1,22 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System.Globalization;

namespace System.Text.RegularExpressions
{
internal sealed class CompiledRegexRunner : RegexRunner
{
private readonly ScanDelegate _scanMethod;
/// <summary>This field will only be set if the pattern contains backreferences and has RegexOptions.IgnoreCase</summary>
private readonly TextInfo? _textInfo;

internal delegate void ScanDelegate(RegexRunner runner, ReadOnlySpan<char> text);

public CompiledRegexRunner(ScanDelegate scan)
public CompiledRegexRunner(ScanDelegate scan, CultureInfo? culture)
{
_scanMethod = scan;
_textInfo = culture?.TextInfo;
}

protected internal override void Scan(ReadOnlySpan<char> text)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,24 +1,28 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System.Globalization;
using System.Reflection.Emit;

namespace System.Text.RegularExpressions
{
internal sealed class CompiledRegexRunnerFactory : RegexRunnerFactory
{
private readonly DynamicMethod _scanMethod;
/// <summary>This field will only be set if the pattern has backreferences and uses RegexOptions.IgnoreCase</summary>
private readonly CultureInfo? _culture;

// Delegate is lazily created to avoid forcing JIT'ing until the regex is actually executed.
private CompiledRegexRunner.ScanDelegate? _scan;

public CompiledRegexRunnerFactory(DynamicMethod scanMethod)
public CompiledRegexRunnerFactory(DynamicMethod scanMethod, CultureInfo? culture)
{
_scanMethod = scanMethod;
_culture = culture;
}

protected internal override RegexRunner CreateInstance() =>
new CompiledRegexRunner(
_scan ??= _scanMethod.CreateDelegate<CompiledRegexRunner.ScanDelegate>());
_scan ??= _scanMethod.CreateDelegate<CompiledRegexRunner.ScanDelegate>(), _culture);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,12 @@ internal void SaveDGML(TextWriter writer, bool nfa, bool addDotStar, bool revers
}

/// <summary>
/// Generates two files IgnoreCaseRelation.cs and UnicodeCategoryRanges.cs for the namespace System.Text.RegularExpressions.Symbolic.Unicode
/// Generates UnicodeCategoryRanges.cs for the namespace System.Text.RegularExpressions.Symbolic.Unicode
/// in the given directory path. Only avaliable in DEBUG mode.
/// </summary>
[ExcludeFromCodeCoverage(Justification = "Debug only")]
internal static void GenerateUnicodeTables(string path)
{
IgnoreCaseRelationGenerator.Generate("System.Text.RegularExpressions.Symbolic.Unicode", "IgnoreCaseRelation", path);
UnicodeCategoryRangesGenerator.Generate("System.Text.RegularExpressions.Symbolic.Unicode", "UnicodeCategoryRanges", path);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ internal Regex(string pattern, CultureInfo? culture)
RegexTree tree = Init(pattern, RegexOptions.None, s_defaultMatchTimeout, ref culture);

// Create the interpreter factory.
factory = new RegexInterpreterFactory(tree, culture);
factory = new RegexInterpreterFactory(tree);

// NOTE: This overload _does not_ delegate to the one that takes options, in order
// to avoid unnecessarily rooting the support for RegexOptions.NonBacktracking/Compiler
Expand Down Expand Up @@ -101,7 +101,7 @@ internal Regex(string pattern, RegexOptions options, TimeSpan matchTimeout, Cult
}

// If no factory was created, fall back to creating one for the interpreter.
factory ??= new RegexInterpreterFactory(tree, culture);
factory ??= new RegexInterpreterFactory(tree);
}
}

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System.Globalization;

namespace System.Text.RegularExpressions
{
/// <summary>
/// When a regular expression specifies the option <see cref="RegexOptions.IgnoreCase"/> then comparisons between the input and the
/// pattern will made case-insensitively. In order to support this, we need to define which case mappings shall be used for the comparisons.
/// A case mapping exists whenever you have two characters 'A' and 'B', where either 'A' is the ToLower() representation of 'B' or both 'A' and 'B' lowercase to the
/// same character. Note that we don't consider a mapping when the only relationship between 'A' and 'B' is that one is the ToUpper() representation of the other. This
/// is for backwards compatibility since, in Regex, we have only consider ToLower() for case insensitive comparisons. Given the case mappings vary depending on the culture,
/// Regex supports 3 main different behaviors or mappings: Invariant, NonTurkish, and Turkish. This is in order to match the behavior of all .NET supported cultures
/// current behavior for ToLower(). As a side note, there should be no cases where 'A'.ToLower() == 'B' but 'A'.ToLower() != 'B'.ToLower(). This aspect is important since
/// for backreferences we make use a.ToLower() == b.ToLower() for comparisons so if there was such a case then it would lead to inconsistencies between how we handle
/// backreferences vs how we handle other case insensitive comparisons.
/// </summary>
internal enum RegexCaseBehavior
{
/// <summary>
/// Invariant case-mappings are used. This includes all of the common mappings across cultures. This behavior is used when either the user
/// specified <see cref="RegexOptions.CultureInvariant"/> or when the CurrentCulture is <see cref="CultureInfo.InvariantCulture"/>.
/// </summary>
Invariant,

/// <summary>
/// These are all the same mappings used by Invariant behavior, with an additional one: \u0130 => \u0069
/// This mode will be used when CurrentCulture is not Invariant or any of the tr/az cultures.
/// </summary>
NonTurkish,

/// <summary>
/// These are all the same mappings used by non-Turkish behavior, with the exception of: \u0049 => \u0069 which mapping doesn't exist
/// on this behavior and with the additional mapping of: \u0069 => \u0131. This mode will be used when CurrentCulture is any of the tr/az cultures.
/// </summary>
Turkish
}
}
Loading

0 comments on commit 90908d5

Please sign in to comment.