forked from dotnet/runtime
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve alternation switch optimization in regex source generator (do…
…tnet#98723) The regex source generator has an optimization that tries to emit a switch statement to handle an alternation. If it can prove that an alternation is atomic, either because of a surrounding construct (like an atomic group) or because nothing in the alternation itself might backtrack (like a loop in one of the branches), and if it can prove that none of the branches overlap on the first character they must match (because all branches always begin with a different character from the others), then it can emit a switch over the first character required by each branch. Today, the analysis that leads to this optimization being used only considers branches that start with a specific character (RegexNodeKind.One), a set (RegexNodeKind.Set), a string (RegexNodeKind.Multi), or a concatenation that begins with one of those. Anything else, and it gets knocked off the optimized switch path. With this PR, this is evolved to instead allow those One/Set/Multi constructs to be the first non-zero width construct matched in the branch, but not necessarily the first node, e.g. the branch could be a capture around one of these nodes, or a loop of one of these with a minimum iteration count of at least 1. This PR also adds in support for not just individual chars or sets, but loops of them (normal, lazy, or atomic), again as long as they have a minimum iteration count of 1... this in particular helps with duplicate characters in a row, as earlier optimizations will have likely condensed them into repeaters represented as loops with equal min and max counts. This PR also makes one more tweak, which is that the sets supported may now be larger. Previously the code was allowing for a set to expand to at most 5 characters, an arbitrary limit set primarily to support ignore-case (which would typically result in sets of 2 or 3 characters). But this ignores the fact that previous optimizations may combine sets for a variety of reasons, e.g. an alternation where one branch contains 's' and the next contains 't' would be combined into a single branch for [st]. This limit has now been increased significantly, with little downside; the main limitation is stack consumption, and the new limit is well within typical stackallocs we use ourselves.
- Loading branch information
1 parent
ab88861
commit 061d4df
Showing
2 changed files
with
87 additions
and
54 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters