Skip to content

Commit

Permalink
Extend RegexCharClass.Canonicalize range inversion optimization (#61562)
Browse files Browse the repository at this point in the history
* Extend RegexCharClass.Canonicalize range inversion optimization

There's a simple optimization in RegexCharClass.Canonicalize that was added in .NET 5, with the goal of taking a set that's made up of exactly two ranges and seeing whether those ranges were leaving out exactly one character.  If they were, the set can instead be rewritten as that character negated, which is a normalized form used downstream and optimized.  We can extend this normalization ever so slightly to be for two ranges separated not just be a single character but by more than that as well.

* Update TODO comment

* Add some more reduction tests
  • Loading branch information
stephentoub authored Nov 17, 2021
1 parent d1b3816 commit 44d28bf
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 8 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -1390,23 +1390,29 @@ private void Canonicalize(bool isNonBacktracking)
rangelist.RemoveRange(j, rangelist.Count - j);
}

// If the class now represents a single negated character, but does so by including every
// other character, invert it to produce a normalized form recognized by IsSingletonInverse.
if (!isNonBacktracking && // do not produce the IsSingletonInverse transformation in NonBacktracking mode
// If the class now represents a single negated range, but does so by including every
// other character, invert it to produce a normalized form with a single range. This
// is valuable for subsequent optimizations in most of the engines.
// TODO: https://github.com/dotnet/runtime/issues/61048. The special-casing for NonBacktracking
// can be deleted once this issue is addressed. The special-casing exists because NonBacktracking
// is on a different casing plan than the other engines and doesn't use ToLower on each input
// character at match time; this in turn can highlight differences between sets and their inverted
// versions of themselves, e.g. a difference between [0-AC-\uFFFF] and [^B].
if (!isNonBacktracking &&
!_negate &&
_subtractor is null &&
(_categories is null || _categories.Length == 0))
{
if (rangelist.Count == 2)
{
// There are two ranges in the list. See if there's one missing element between them.
// There are two ranges in the list. See if there's one missing range between them.
// Such a range might be as small as a single character.
if (rangelist[0].First == 0 &&
rangelist[0].Last == (char)(rangelist[1].First - 2) &&
rangelist[1].Last == LastChar)
rangelist[1].Last == LastChar &&
rangelist[0].Last < rangelist[1].First - 1)
{
char ch = (char)(rangelist[0].Last + 1);
rangelist[0] = new SingleRange((char)(rangelist[0].Last + 1), (char)(rangelist[1].First - 1));
rangelist.RemoveAt(1);
rangelist[0] = new SingleRange(ch, ch);
_negate = true;
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,13 @@ private static int GetMinRequiredLength(Regex r)
[InlineData("[^\n]*", ".*")]
[InlineData("(?>[^\n]*)", "(?>.*)")]
[InlineData("[^\n]*?", ".*?")]
// Set reduction
[InlineData("[\u0001-\uFFFF]", "[^\u0000]")]
[InlineData("[\u0000-\uFFFE]", "[^\uFFFF]")]
[InlineData("[\u0000-AB-\uFFFF]", "[\u0000-\uFFFF]")]
[InlineData("[ABC-EG-J]", "[A-EG-J]")]
[InlineData("[\u0000-AC-\uFFFF]", "[^B]")]
[InlineData("[\u0000-AF-\uFFFF]", "[^B-E]")]
// Large loop patterns
[InlineData("a*a*a*a*a*a*a*b*b*?a+a*", "a*b*b*?a+")]
[InlineData("a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", "a{0,30}aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")]
Expand Down

0 comments on commit 44d28bf

Please sign in to comment.