Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the logic for how we deal with RegexOptions.IgnoreCase matching. #67184

Merged
merged 12 commits into from
Apr 6, 2022
Merged
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Instructions for updating Unicode version in dotnet/runtime

## Table of Contents

- [Instructions for updating Unicode version in dotnet/runtime](#instructions-for-updating-unicode-version-in-dotnetruntime)
- [Table of Contents](#table-of-contents)
- [Overview](#overview)
Expand All @@ -24,8 +25,7 @@ This repository has several places that need to be updated when we are ingesting
- extracted/DerivedBidiClass.txt
- extracted/DerivedName.txt

2. Once you have downloaded all those files, create a fork of the repo https://github.com/dotnet/runtime-assets and send a PR which creates a folder at `src/System.Private.Runtime.UnicodeData/<YourUnicodeVersion>` and places all of the downloaded files from step 1 there. You can look at a sample PR that did this for Unicode 14.0.0 here: https://github.com/dotnet/runtime-assets/pull/179

2. Once you have downloaded all those files, create a fork of the repo <https://github.com/dotnet/runtime-assets> and send a PR which creates a folder at `src/System.Private.Runtime.UnicodeData/<YourUnicodeVersion>` and places all of the downloaded files from step 1 there. You can look at a sample PR that did this for Unicode 14.0.0 here: <https://github.com/dotnet/runtime-assets/pull/179>

## Ingest the created package into dotnet/runtime repo

Expand All @@ -42,6 +42,6 @@ This should be done automatically by dependency-flow, so in theory there shouldn
- System.Globalization.Nls.Tests.csproj
- System.Text.Encodings.Web.Tests.csproj
4. If the new Unicode data contains casing changes/updates, then we will also need to update `src/coreclr/pal/src/locale/unicodedata.cpp` file. This file is used by most of the reflection stack whenever you specify the `BindingFlags.IgnoreCase`. In order to regenerate the contents of the `unicdedata.cpp` file, you need to run the Program located at `src/coreclr/pal/src/locale/unicodedata.cs` and give a full path to the new UnicodeData.txt as a parameter.
5. If the new Unicode data made changes on what character class a specific character belongs to, or added new characters, you may need to update the serialized Unicode character classes data in `System.Text.RegularExpressions` for the `NonBacktracking` engine. The telling sign that will show you if you need to do this, is if any tests are failing in the `System.Text.RegularExpressions.Tests` test project. In case some tests do fail (which means you need to update the serialized mappings), you will need to edit the file `src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexExperiment.cs` and set the `Enabled` bool to `true`, and re-run the RegexTests. This will generate a couple of files in your `%temp%` directory: `IgnoreCaseRelation.cs` and `UnicodeCategoryRanges.cs`. These files will need to be copied (and overwrite the existing ones) to the folder `src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Unicode/`
5. Update the Regex casing equivalence table using the UnicodeData.txt file from the new Unicode version. You can find the instructions on how to do this [here](../../../System.Text.RegularExpressions/tools/Readme.md).
6. Finally, last step is to update the license for the Unicode data into our [Third party notices](../../../../../THIRD-PARTY-NOTICES.TXT) by copying the contents located in `https://www.unicode.org/license.html` to the section that has the Unicode license in our notices.
7. That's it, now commit all of the changed files, and send a PR into dotnet/runtime with the updates. If there were any special things you had to do that are not noted on this document, PLEASE UPDATE THESE INSTRUCTIONS to facilitate future updates.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,10 @@
<Compile Include="$(CoreLibSharedDir)System\Collections\Generic\ValueListBuilder.cs" Link="Production\ValueListBuilder.cs" />
<Compile Include="..\src\System\Collections\Generic\ValueListBuilder.Pop.cs" Link="Production\ValueListBuilder.Pop.cs" />
<Compile Include="..\src\System\Threading\StackHelper.cs" Link="Production\StackHelper.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexCaseEquivalences.Data.cs" Link="Production\RegexCaseEquivalences.Data.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexCaseEquivalences.cs" Link="Production\RegexCaseEquivalences.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexCaseBehavior.cs" Link="Production\RegexCaseBehavior.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexCharClass.cs" Link="Production\RegexCharClass.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexCharClass.MappingTable.cs" Link="Production\RegexCharClass.MappingTable.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexFindOptimizations.cs" Link="Production\RegexFindOptimizations.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexNode.cs" Link="Production\RegexNode.cs" />
<Compile Include="..\src\System\Text\RegularExpressions\RegexNodeKind.cs" Link="Production\RegexNodeKind.cs" />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,10 @@
<Compile Include="System\Text\RegularExpressions\Regex.Replace.cs" />
<Compile Include="System\Text\RegularExpressions\Regex.Split.cs" />
<Compile Include="System\Text\RegularExpressions\Regex.Timeout.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCaseBehavior.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCaseEquivalences.Data.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCaseEquivalences.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCharClass.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCharClass.MappingTable.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCompilationInfo.cs" />
<Compile Include="System\Text\RegularExpressions\RegexFindOptimizations.cs" />
<Compile Include="System\Text\RegularExpressions\RegexGeneratorAttribute.cs" />
Expand Down Expand Up @@ -83,10 +85,6 @@
<Compile Include="System\Text\RegularExpressions\Symbolic\SymbolicRegexSet.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\TransitionRegex.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\TransitionRegexKind.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\GeneratorHelper.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\IgnoreCaseRelation.cs" />
joperezr marked this conversation as resolved.
Show resolved Hide resolved
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\IgnoreCaseRelationGenerator.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\IgnoreCaseTransformer.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\UnicodeCategoryRanges.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\UnicodeCategoryRangesGenerator.cs" />
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\UnicodeCategoryTheory.cs" />
Expand Down
Original file line number Diff line number Diff line change
@@ -1,17 +1,22 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System.Globalization;

namespace System.Text.RegularExpressions
{
internal sealed class CompiledRegexRunner : RegexRunner
{
private readonly ScanDelegate _scanMethod;
/// <summary>This field will only be set if the pattern contains backreferences and has RegexOptions.IgnoreCase</summary>
private readonly TextInfo? _textInfo;
joperezr marked this conversation as resolved.
Show resolved Hide resolved

internal delegate void ScanDelegate(RegexRunner runner, ReadOnlySpan<char> text);

public CompiledRegexRunner(ScanDelegate scan)
public CompiledRegexRunner(ScanDelegate scan, CultureInfo? culture)
{
_scanMethod = scan;
_textInfo = culture?.TextInfo;
}

protected internal override void Scan(ReadOnlySpan<char> text)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,24 +1,28 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System.Globalization;
using System.Reflection.Emit;

namespace System.Text.RegularExpressions
{
internal sealed class CompiledRegexRunnerFactory : RegexRunnerFactory
{
private readonly DynamicMethod _scanMethod;
/// <summary>This field will only be set if the pattern has backreferences and uses RegexOptions.IgnoreCase</summary>
private readonly CultureInfo? _culture;

// Delegate is lazily created to avoid forcing JIT'ing until the regex is actually executed.
private CompiledRegexRunner.ScanDelegate? _scan;

public CompiledRegexRunnerFactory(DynamicMethod scanMethod)
public CompiledRegexRunnerFactory(DynamicMethod scanMethod, CultureInfo? culture)
{
_scanMethod = scanMethod;
_culture = culture;
}

protected internal override RegexRunner CreateInstance() =>
new CompiledRegexRunner(
_scan ??= _scanMethod.CreateDelegate<CompiledRegexRunner.ScanDelegate>());
_scan ??= _scanMethod.CreateDelegate<CompiledRegexRunner.ScanDelegate>(), _culture);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,12 @@ internal void SaveDGML(TextWriter writer, bool nfa, bool addDotStar, bool revers
}

/// <summary>
/// Generates two files IgnoreCaseRelation.cs and UnicodeCategoryRanges.cs for the namespace System.Text.RegularExpressions.Symbolic.Unicode
/// Generates UnicodeCategoryRanges.cs for the namespace System.Text.RegularExpressions.Symbolic.Unicode
/// in the given directory path. Only avaliable in DEBUG mode.
/// </summary>
[ExcludeFromCodeCoverage(Justification = "Debug only")]
internal static void GenerateUnicodeTables(string path)
{
IgnoreCaseRelationGenerator.Generate("System.Text.RegularExpressions.Symbolic.Unicode", "IgnoreCaseRelation", path);
UnicodeCategoryRangesGenerator.Generate("System.Text.RegularExpressions.Symbolic.Unicode", "UnicodeCategoryRanges", path);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ internal Regex(string pattern, CultureInfo? culture)
RegexTree tree = Init(pattern, RegexOptions.None, s_defaultMatchTimeout, ref culture);

// Create the interpreter factory.
factory = new RegexInterpreterFactory(tree, culture);
factory = new RegexInterpreterFactory(tree);

// NOTE: This overload _does not_ delegate to the one that takes options, in order
// to avoid unnecessarily rooting the support for RegexOptions.NonBacktracking/Compiler
Expand Down Expand Up @@ -101,7 +101,7 @@ internal Regex(string pattern, RegexOptions options, TimeSpan matchTimeout, Cult
}

// If no factory was created, fall back to creating one for the interpreter.
factory ??= new RegexInterpreterFactory(tree, culture);
factory ??= new RegexInterpreterFactory(tree);
}
}

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System.Globalization;

namespace System.Text.RegularExpressions
{
/// <summary>
/// When a regular expression specifies the option <see cref="RegexOptions.IgnoreCase"/> then comparisons between the input and the
/// pattern will made case-insensitively. In order to support this, we need to define which case mappings shall be used for the comparisons.
/// A case mapping exists whenever you have two characters 'A' and 'B', where either 'A' is the ToLower() representation of 'B' or both 'A' and 'B' lowercase to the
joperezr marked this conversation as resolved.
Show resolved Hide resolved
/// same character. Note that we don't consider a mapping when the only relationship between 'A' and 'B' is that one is the ToUpper() representation of the other. This
/// is for backwards compatibility since, in Regex, we have only consider ToLower() for case insensitive comparisons. Given the case mappings vary depending on the culture,
/// Regex supports 3 main different behaviors or mappings: Invariant, NonTurkish, and Turkish. This is in order to match the behavior of all .NET supported cultures
/// current behavior for ToLower(). As a side note, there should be no cases where 'A'.ToLower() == 'B' but 'A'.ToLower() != 'B'.ToLower(). This aspect is important since
/// for backreferences we make use a.ToLower() == b.ToLower() for comparisons so if there was such a case then it would lead to inconsistencies between how we handle
/// backreferences vs how we handle other case insensitive comparisons.
/// </summary>
internal enum RegexCaseBehavior
{
/// <summary>
/// Invariant case-mappings are used. This includes all of the common mappings across cultures. This behavior is used when either the user
/// specified <see cref="RegexOptions.CultureInvariant"/> or when the CurrentCulture is <see cref="CultureInfo.InvariantCulture"/>.
/// </summary>
Invariant,

/// <summary>
/// These are all the same mappings used by Invariant behavior, with an additional one: \u0130 => \u0069
/// This mode will be used when CurrentCulture is not Invariant or any of the tr/az cultures.
/// </summary>
NonTurkish,

/// <summary>
/// These are all the same mappings used by non-Turkish behavior, with the exception of: \u0049 => \u0069 which mapping doesn't exist
/// on this behavior and with the additional mapping of: \u0069 => \u0131. This mode will be used when CurrentCulture is any of the tr/az cultures.
/// </summary>
Turkish
}
}
Loading