Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factor out and improve the vectorization of RegexInterpreter.FindFirstChar #61490

Merged
merged 6 commits into from
Nov 17, 2021

Commits on Nov 17, 2021

  1. Factor out and improve the vectorization of RegexInterpreter.FindFirs…

    …tChar
    
    This change started with the "simple" goal of factoring out the FindFirstChar logic from RegexInterpreter and consuming it in SymbolicRegexMatcher.  The existing engines use FindFirstChar to quickly skip ahead to the next location that might possibly match, at which point they fall back to analyzing the whole pattern at that location.  SymbolicRegexMatcher (used by RegexOptions.NonBacktracking) had its own implementation for this, which it used any time it entered a start state.  This required non-trivial additional code to maintain, and there's no good reason it should be separate from the other engines.
    
    However, what started out as a simple change grew due to regressions that resulted from differences in the implementations.  In particular, SymbolicRegexMatcher already works off of precomputed equivalence tables for casing, which gives it very different characteristics in this regard from the existing engines.  For example, SymbolicRegexMatcher's existing "skip ahead to the next possible match start location" logic already evaluated all the characters that could possibly start a match, which included variations of the same character when using IgnoreCase, but the existing RegexInterpreter logic didn't.  That discrepancy then results in a significant IgnoreCase regression for NonBacktracking due to losing the ability to use a vectorized search for the next starting location.  We already plan to shift the existing engines over to a plan where all of these equivalences are computed at construction time rather than using ToLower at both construction time and match time, so this PR takes some steps in that direction, doing so for most of ASCII.  This has added some temporary cruft, which we'll be able to delete once we fully shift the implementations over (which we should do in the near future).
    
    Another difference was SymbolicRegexMatcher was enabling use of IndexOfAny for up to 5 characters, whereas RegexOptions.Compiled was only doing up to 3 characters, and RegexInterpreter wasn't doing for any number.  The PR now uses 5 everywhere.
    
    However, the more characters involved, the more overhead there is to IndexOfAny, and for some inputs, the higher the chances are that IndexOfAny will find a match sooner, which means its overhead compounds more.  To help with that, we now not only compute the possible characters that might match at the beginning of the pattern, but also characters that might match at a fixed offset from the beginning of the pattern (e.g. in \d{3}-\d{2}-\d{4}, it will find the '-' at offset 3 and be able to vectorize a search for that and then back off by the relevant distance.  That then also means we might end up with multiple sets to choose to search for, and this PR borrows an idea from Rust, which is to use some rough frequency analysis to determine which set should be targeted.  It's not perfect, and we can update the texts use to seed the analysis (right now I based it primarily on *.cs files in dotnet/runtime and some Project Gutenberg texts), but it's good enough for these purposes for now.
    
    We'd previously switched to using IndexOf for a case-sensitive prefix string, but still were using Boyer-Moore for case-insensitive.  Now that we're able to also vectorize a search for case-insensitive values (right now just ASCII letter, but that'll be fixed soon), we can just get rid of Boyer-Moore entirely.  This saves all the costs to do with constructing the Boyer-Moore tables and also avoids having to generate the Boyer-Moore implementations in RegexOptions.Compiled and the source generator.
    
    The casing change also defeated some other optimizations already present.  For example, in .NET 5 we added an optimization whereby an alternation like `abcef|abcgh` would be transformed into `abc(?:ef|gh)`, and that would apply whether case-sensitive or case-insensitive.  But by transforming the expression at construction now for case-insensitive into `[Aa][Bb][Cc][Ee][Ff]|[Aa][Bb][Cc][Gg][Hh]`, that optimization was defeated.  I've added a new optimization pass for alternations that will detect common prefixes even if they're sets.
    
    The casing change also revealed some cosmetic issues.  As part of the change, when we encounter a "multi" (a multi-character string in the pattern), we convert that single case-insensitive RegexNode to instead be one case-sensitive RegexNode per character, with a set for all the equivalent characters that can match.  This then defeats some of the nice formatting we had for multis in the source generator, so as part of this change, the source generator has been augmented to output nicer code for concatenations.  And because sets like [Ee] are now way more common (since e.g. a case-insensitive 'e' will be transformed into such a set), we also special-case that in both the source generator and RegexOptions.Compiled, to spit out the equivalent of `(c | 0x20) == 'e'` rather than `(c == 'E'| c == 'e')`.
    
    Along the way, I cleaned up a few things as well, such as passing around a CultureInfo more rather than repeatedly calling CultureInfo.CurrentCulture, using CollectionsMarshal.GetValueRefOrAddDefault on a hot path to do with interning strings in a lookup table, tweaking SymbolicRegexRunnerFactory's Runner to itself be generic to avoid an extra layer of virtual dispatch per operation, and cleaning up code / comments in SymbolicRegexMatcher along the way.
    
    For the most part the purpose of the change wasn't to improve perf, and in fact I was willing to accept some regressions in the name of consolidation.  There are a few regressions here, mostly small, and mostly for cases where we're simply paying an overhead for vectorization, e.g. where the current location is fine to match, or where the target character being searched for is very frequent.  Overall, though, there are some substantial improvements.
    stephentoub committed Nov 17, 2021
    Configuration menu
    Copy the full SHA
    fdbffd2 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    61e3cef View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    6679536 View commit details
    Browse the repository at this point in the history
  4. Delete more now dead code

    stephentoub committed Nov 17, 2021
    Configuration menu
    Copy the full SHA
    502b505 View commit details
    Browse the repository at this point in the history
  5. Fix dead code emitting after refactoring

    Previously return statements while emitting anchors were short-circuiting the rest of the emitting code, but when I moved that code into a helper, the returns stopped having that impact, such that we'd end up emitting a return statement and then emit dead code after it.  Fix it.
    stephentoub committed Nov 17, 2021
    Configuration menu
    Copy the full SHA
    51fcda3 View commit details
    Browse the repository at this point in the history
  6. Remove some now dead code

    stephentoub committed Nov 17, 2021
    Configuration menu
    Copy the full SHA
    6ce524f View commit details
    Browse the repository at this point in the history