Factor out and improve the vectorization of RegexInterpreter.FindFirstChar #61490

…tChar This change started with the "simple" goal of factoring out the FindFirstChar logic from RegexInterpreter and consuming it in SymbolicRegexMatcher. The existing engines use FindFirstChar to quickly skip ahead to the next location that might possibly match, at which point they fall back to analyzing the whole pattern at that location. SymbolicRegexMatcher (used by RegexOptions.NonBacktracking) had its own implementation for this, which it used any time it entered a start state. This required non-trivial additional code to maintain, and there's no good reason it should be separate from the other engines. However, what started out as a simple change grew due to regressions that resulted from differences in the implementations. In particular, SymbolicRegexMatcher already works off of precomputed equivalence tables for casing, which gives it very different characteristics in this regard from the existing engines. For example, SymbolicRegexMatcher's existing "skip ahead to the next possible match start location" logic already evaluated all the characters that could possibly start a match, which included variations of the same character when using IgnoreCase, but the existing RegexInterpreter logic didn't. That discrepancy then results in a significant IgnoreCase regression for NonBacktracking due to losing the ability to use a vectorized search for the next starting location. We already plan to shift the existing engines over to a plan where all of these equivalences are computed at construction time rather than using ToLower at both construction time and match time, so this PR takes some steps in that direction, doing so for most of ASCII. This has added some temporary cruft, which we'll be able to delete once we fully shift the implementations over (which we should do in the near future). Another difference was SymbolicRegexMatcher was enabling use of IndexOfAny for up to 5 characters, whereas RegexOptions.Compiled was only doing up to 3 characters, and RegexInterpreter wasn't doing for any number. The PR now uses 5 everywhere. However, the more characters involved, the more overhead there is to IndexOfAny, and for some inputs, the higher the chances are that IndexOfAny will find a match sooner, which means its overhead compounds more. To help with that, we now not only compute the possible characters that might match at the beginning of the pattern, but also characters that might match at a fixed offset from the beginning of the pattern (e.g. in \d{3}-\d{2}-\d{4}, it will find the '-' at offset 3 and be able to vectorize a search for that and then back off by the relevant distance. That then also means we might end up with multiple sets to choose to search for, and this PR borrows an idea from Rust, which is to use some rough frequency analysis to determine which set should be targeted. It's not perfect, and we can update the texts use to seed the analysis (right now I based it primarily on *.cs files in dotnet/runtime and some Project Gutenberg texts), but it's good enough for these purposes for now. We'd previously switched to using IndexOf for a case-sensitive prefix string, but still were using Boyer-Moore for case-insensitive. Now that we're able to also vectorize a search for case-insensitive values (right now just ASCII letter, but that'll be fixed soon), we can just get rid of Boyer-Moore entirely. This saves all the costs to do with constructing the Boyer-Moore tables and also avoids having to generate the Boyer-Moore implementations in RegexOptions.Compiled and the source generator. The casing change also defeated some other optimizations already present. For example, in .NET 5 we added an optimization whereby an alternation like `abcef|abcgh` would be transformed into `abc(?:ef|gh)`, and that would apply whether case-sensitive or case-insensitive. But by transforming the expression at construction now for case-insensitive into `[Aa][Bb][Cc][Ee][Ff]|[Aa][Bb][Cc][Gg][Hh]`, that optimization was defeated. I've added a new optimization pass for alternations that will detect common prefixes even if they're sets. The casing change also revealed some cosmetic issues. As part of the change, when we encounter a "multi" (a multi-character string in the pattern), we convert that single case-insensitive RegexNode to instead be one case-sensitive RegexNode per character, with a set for all the equivalent characters that can match. This then defeats some of the nice formatting we had for multis in the source generator, so as part of this change, the source generator has been augmented to output nicer code for concatenations. And because sets like [Ee] are now way more common (since e.g. a case-insensitive 'e' will be transformed into such a set), we also special-case that in both the source generator and RegexOptions.Compiled, to spit out the equivalent of `(c | 0x20) == 'e'` rather than `(c == 'E'| c == 'e')`. Along the way, I cleaned up a few things as well, such as passing around a CultureInfo more rather than repeatedly calling CultureInfo.CurrentCulture, using CollectionsMarshal.GetValueRefOrAddDefault on a hot path to do with interning strings in a lookup table, tweaking SymbolicRegexRunnerFactory's Runner to itself be generic to avoid an extra layer of virtual dispatch per operation, and cleaning up code / comments in SymbolicRegexMatcher along the way. For the most part the purpose of the change wasn't to improve perf, and in fact I was willing to accept some regressions in the name of consolidation. There are a few regressions here, mostly small, and mostly for cases where we're simply paying an overhead for vectorization, e.g. where the current location is fine to match, or where the target character being searched for is very frequent. Overall, though, there are some substantial improvements.

Previously return statements while emitting anchors were short-circuiting the rest of the emitting code, but when I moved that code into a helper, the returns stopped having that impact, such that we'd end up emitting a return statement and then emit dead code after it. Fix it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Factor out and improve the vectorization of RegexInterpreter.FindFirstChar #61490

Factor out and improve the vectorization of RegexInterpreter.FindFirstChar #61490

Commits on Nov 17, 2021