Factor out and improve the vectorization of RegexInterpreter.FindFirs…

…tChar (#61490) * Factor out and improve the vectorization of RegexInterpreter.FindFirstChar This change started with the "simple" goal of factoring out the FindFirstChar logic from RegexInterpreter and consuming it in SymbolicRegexMatcher. The existing engines use FindFirstChar to quickly skip ahead to the next location that might possibly match, at which point they fall back to analyzing the whole pattern at that location. SymbolicRegexMatcher (used by RegexOptions.NonBacktracking) had its own implementation for this, which it used any time it entered a start state. This required non-trivial additional code to maintain, and there's no good reason it should be separate from the other engines. However, what started out as a simple change grew due to regressions that resulted from differences in the implementations. In particular, SymbolicRegexMatcher already works off of precomputed equivalence tables for casing, which gives it very different characteristics in this regard from the existing engines. For example, SymbolicRegexMatcher's existing "skip ahead to the next possible match start location" logic already evaluated all the characters that could possibly start a match, which included variations of the same character when using IgnoreCase, but the existing RegexInterpreter logic didn't. That discrepancy then results in a significant IgnoreCase regression for NonBacktracking due to losing the ability to use a vectorized search for the next starting location. We already plan to shift the existing engines over to a plan where all of these equivalences are computed at construction time rather than using ToLower at both construction time and match time, so this PR takes some steps in that direction, doing so for most of ASCII. This has added some temporary cruft, which we'll be able to delete once we fully shift the implementations over (which we should do in the near future). Another difference was SymbolicRegexMatcher was enabling use of IndexOfAny for up to 5 characters, whereas RegexOptions.Compiled was only doing up to 3 characters, and RegexInterpreter wasn't doing for any number. The PR now uses 5 everywhere. However, the more characters involved, the more overhead there is to IndexOfAny, and for some inputs, the higher the chances are that IndexOfAny will find a match sooner, which means its overhead compounds more. To help with that, we now not only compute the possible characters that might match at the beginning of the pattern, but also characters that might match at a fixed offset from the beginning of the pattern (e.g. in \d{3}-\d{2}-\d{4}, it will find the '-' at offset 3 and be able to vectorize a search for that and then back off by the relevant distance. That then also means we might end up with multiple sets to choose to search for, and this PR borrows an idea from Rust, which is to use some rough frequency analysis to determine which set should be targeted. It's not perfect, and we can update the texts use to seed the analysis (right now I based it primarily on *.cs files in dotnet/runtime and some Project Gutenberg texts), but it's good enough for these purposes for now. We'd previously switched to using IndexOf for a case-sensitive prefix string, but still were using Boyer-Moore for case-insensitive. Now that we're able to also vectorize a search for case-insensitive values (right now just ASCII letter, but that'll be fixed soon), we can just get rid of Boyer-Moore entirely. This saves all the costs to do with constructing the Boyer-Moore tables and also avoids having to generate the Boyer-Moore implementations in RegexOptions.Compiled and the source generator. The casing change also defeated some other optimizations already present. For example, in .NET 5 we added an optimization whereby an alternation like `abcef|abcgh` would be transformed into `abc(?:ef|gh)`, and that would apply whether case-sensitive or case-insensitive. But by transforming the expression at construction now for case-insensitive into `[Aa][Bb][Cc][Ee][Ff]|[Aa][Bb][Cc][Gg][Hh]`, that optimization was defeated. I've added a new optimization pass for alternations that will detect common prefixes even if they're sets. The casing change also revealed some cosmetic issues. As part of the change, when we encounter a "multi" (a multi-character string in the pattern), we convert that single case-insensitive RegexNode to instead be one case-sensitive RegexNode per character, with a set for all the equivalent characters that can match. This then defeats some of the nice formatting we had for multis in the source generator, so as part of this change, the source generator has been augmented to output nicer code for concatenations. And because sets like [Ee] are now way more common (since e.g. a case-insensitive 'e' will be transformed into such a set), we also special-case that in both the source generator and RegexOptions.Compiled, to spit out the equivalent of `(c | 0x20) == 'e'` rather than `(c == 'E'| c == 'e')`. Along the way, I cleaned up a few things as well, such as passing around a CultureInfo more rather than repeatedly calling CultureInfo.CurrentCulture, using CollectionsMarshal.GetValueRefOrAddDefault on a hot path to do with interning strings in a lookup table, tweaking SymbolicRegexRunnerFactory's Runner to itself be generic to avoid an extra layer of virtual dispatch per operation, and cleaning up code / comments in SymbolicRegexMatcher along the way. For the most part the purpose of the change wasn't to improve perf, and in fact I was willing to accept some regressions in the name of consolidation. There are a few regressions here, mostly small, and mostly for cases where we're simply paying an overhead for vectorization, e.g. where the current location is fine to match, or where the target character being searched for is very frequent. Overall, though, there are some substantial improvements. * Fix missing condition in RegexCompiler * Try to fix mono failures and address comment feedback * Delete more now dead code * Fix dead code emitting after refactoring Previously return statements while emitting anchors were short-circuiting the rest of the emitting code, but when I moved that code into a helper, the returns stopped having that impact, such that we'd end up emitting a return statement and then emit dead code after it. Fix it. * Remove some now dead code
dotnet · Nov 17, 2021 · 25237fa · 25237fa
1 parent a508fb5
commit 25237fa
Show file tree

Hide file tree

Showing 35 changed files with 3,595 additions and 3,487 deletions.
diff --git a/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs b/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
diff --git a/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Parser.cs b/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Parser.cs
@@ -178,7 +178,7 @@ private static bool IsSyntaxTargetForGeneration(SyntaxNode node) =>
             RegexCode code;
             try
             {
-                code = RegexWriter.Write(RegexParser.Parse(pattern, regexOptions, culture));
+                code = RegexWriter.Write(RegexParser.Parse(pattern, regexOptions, culture), culture);
             }
             catch (Exception e)
             {

diff --git a/src/libraries/System.Text.RegularExpressions/gen/Stubs.cs b/src/libraries/System.Text.RegularExpressions/gen/Stubs.cs
@@ -54,13 +54,13 @@ namespace System.Threading
 {
     internal static class InterlockedExtensions
     {
-        public static int Or(ref int location1, int value)
+        public static uint Or(ref uint location1, uint value)
         {
-            int current = location1;
+            uint current = location1;
             while (true)
             {
-                int newValue = current | value;
-                int oldValue = Interlocked.CompareExchange(ref location1, newValue, current);
+                uint newValue = current | value;
+                uint oldValue = (uint)Interlocked.CompareExchange(ref Unsafe.As<uint, int>(ref location1), (int)newValue, (int)current);
                 if (oldValue == current)
                 {
                     return oldValue;

diff --git a/...raries/System.Text.RegularExpressions/gen/System.Text.RegularExpressions.Generator.csproj b/...raries/System.Text.RegularExpressions/gen/System.Text.RegularExpressions.Generator.csproj
@@ -30,10 +30,10 @@
     <Compile Include="$(CoreLibSharedDir)System\Collections\Generic\ValueListBuilder.cs" Link="Production\ValueListBuilder.cs" />
     <Compile Include="..\src\System\Collections\Generic\ValueListBuilder.Pop.cs" Link="Production\ValueListBuilder.Pop.cs" />
     <Compile Include="..\src\System\Threading\StackHelper.cs" Link="Production\StackHelper.cs" />
-    <Compile Include="..\src\System\Text\RegularExpressions\RegexBoyerMoore.cs" Link="Production\RegexBoyerMoore.cs" />
     <Compile Include="..\src\System\Text\RegularExpressions\RegexCharClass.cs" Link="Production\RegexCharClass.cs" />
     <Compile Include="..\src\System\Text\RegularExpressions\RegexCharClass.MappingTable.cs" Link="Production\RegexCharClass.MappingTable.cs" />
     <Compile Include="..\src\System\Text\RegularExpressions\RegexCode.cs" Link="Production\RegexCode.cs" />
+    <Compile Include="..\src\System\Text\RegularExpressions\RegexFindOptimizations.cs" Link="Production\RegexFindOptimizations.cs" />
     <Compile Include="..\src\System\Text\RegularExpressions\RegexNode.cs" Link="Production\RegexNode.cs" />
     <Compile Include="..\src\System\Text\RegularExpressions\RegexOptions.cs" Link="Production\RegexOptions.cs" />
     <Compile Include="..\src\System\Text\RegularExpressions\RegexParseError.cs" Link="Production\RegexParseError.cs" />

diff --git a/src/libraries/System.Text.RegularExpressions/src/System.Text.RegularExpressions.csproj b/src/libraries/System.Text.RegularExpressions/src/System.Text.RegularExpressions.csproj
@@ -23,11 +23,11 @@
     <Compile Include="System\Text\RegularExpressions\Regex.Replace.cs" />
     <Compile Include="System\Text\RegularExpressions\Regex.Split.cs" />
     <Compile Include="System\Text\RegularExpressions\Regex.Timeout.cs" />
-    <Compile Include="System\Text\RegularExpressions\RegexBoyerMoore.cs" />
     <Compile Include="System\Text\RegularExpressions\RegexCharClass.cs" />
     <Compile Include="System\Text\RegularExpressions\RegexCharClass.MappingTable.cs" />
     <Compile Include="System\Text\RegularExpressions\RegexCode.cs" />
     <Compile Include="System\Text\RegularExpressions\RegexCompilationInfo.cs" />
+    <Compile Include="System\Text\RegularExpressions\RegexFindOptimizations.cs" />
     <Compile Include="System\Text\RegularExpressions\RegexGeneratorAttribute.cs" />
     <Compile Include="System\Text\RegularExpressions\RegexInterpreter.cs" />
     <Compile Include="System\Text\RegularExpressions\RegexMatchTimeoutException.cs" />
@@ -100,6 +100,7 @@
     <Reference Include="System.Memory" />
     <Reference Include="System.Runtime" />
     <Reference Include="System.Runtime.Extensions" />
+    <Reference Include="System.Runtime.InteropServices" />
     <Reference Include="System.Threading" />
     <!-- References required for RegexOptions.Compiled -->
     <Reference Include="System.Reflection.Emit" />

diff --git a/...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.Cache.cs b/...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.Cache.cs
@@ -128,7 +128,7 @@ public static Regex GetOrAdd(string pattern, RegexOptions options, TimeSpan matc
             Regex.ValidateOptions(options);
             Regex.ValidateMatchTimeout(matchTimeout);
 
-            CultureInfo culture = (options & RegexOptions.CultureInvariant) != 0 ? CultureInfo.InvariantCulture : CultureInfo.CurrentCulture;
+            CultureInfo culture = RegexParser.GetTargetCulture(options);
             Key key = new Key(pattern, culture.ToString(), options, matchTimeout);
 
             Regex? regex = Get(key);

diff --git a/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs b/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs
@@ -65,12 +65,12 @@ internal Regex(string pattern, CultureInfo? culture)
             // Call Init directly rather than delegating to a Regex ctor that takes
             // options to enable linking / tree shaking to remove the Regex compiler
             // and NonBacktracking implementation if it's not used.
-            Init(pattern, RegexOptions.None, s_defaultMatchTimeout, culture);
+            Init(pattern, RegexOptions.None, s_defaultMatchTimeout, culture ?? CultureInfo.CurrentCulture);
         }
 
         internal Regex(string pattern, RegexOptions options, TimeSpan matchTimeout, CultureInfo? culture)
         {
-            culture ??= GetTargetCulture(options);
+            culture ??= RegexParser.GetTargetCulture(options);
             Init(pattern, options, matchTimeout, culture);
 
             if ((options & RegexOptions.NonBacktracking) != 0)
@@ -87,18 +87,14 @@ internal Regex(string pattern, RegexOptions options, TimeSpan matchTimeout, Cult
             }
         }
 
-        /// <summary>Gets the culture to use based on the specified options.</summary>
-        private static CultureInfo GetTargetCulture(RegexOptions options) =>
-            (options & RegexOptions.CultureInvariant) != 0 ? CultureInfo.InvariantCulture : CultureInfo.CurrentCulture;
-
         /// <summary>Initializes the instance.</summary>
         /// <remarks>
         /// This is separated out of the constructor so that an app only using 'new Regex(pattern)'
         /// rather than 'new Regex(pattern, options)' can avoid statically referencing the Regex
         /// compiler, such that a tree shaker / linker can trim it away if it's not otherwise used.
         /// </remarks>
         [MemberNotNull(nameof(_code))]
-        private void Init(string pattern, RegexOptions options, TimeSpan matchTimeout, CultureInfo? culture)
+        private void Init(string pattern, RegexOptions options, TimeSpan matchTimeout, CultureInfo culture)
         {
             ValidatePattern(pattern);
             ValidateOptions(options);
@@ -107,7 +103,6 @@ private void Init(string pattern, RegexOptions options, TimeSpan matchTimeout, C
             this.pattern = pattern;
             internalMatchTimeout = matchTimeout;
             roptions = options;
-            culture ??= GetTargetCulture(options);
 
 #if DEBUG
             if (IsDebug)
@@ -121,7 +116,7 @@ private void Init(string pattern, RegexOptions options, TimeSpan matchTimeout, C
 
             // Generate the RegexCode from the node tree.  This is required for interpreting,
             // and is used as input into RegexOptions.Compiled and RegexOptions.NonBacktracking.
-            _code = RegexWriter.Write(tree);
+            _code = RegexWriter.Write(tree, culture);
 
             if ((options & RegexOptions.NonBacktracking) != 0)
             {
@@ -434,7 +429,7 @@ internal void Run<TState>(string input, int startat, ref TState state, MatchCall
         /// <summary>Creates a new runner instance.</summary>
         private RegexRunner CreateRunner() =>
             factory?.CreateInstance() ??
-            new RegexInterpreter(_code!, GetTargetCulture(roptions));
+            new RegexInterpreter(_code!, RegexParser.GetTargetCulture(roptions));
 
         /// <summary>True if the <see cref="RegexOptions.Compiled"/> option was set.</summary>
         protected bool UseOptionC() => (roptions & RegexOptions.Compiled) != 0;