-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Explore using Aho-Corasick for specific patterns in Regex #45697
Conversation
Tagging subscribers to this area: @eerhardt, @pgovind Issue DetailsThis is draft PR, strictly for prototyping and exploring using Aho-Corasick (AC) for specific patterns in Regex and to gather initial feedback.
Over the past couple weeks, I spent some time prototyping AC in Regex. As #1349 points out, we should expect to see perf gains when alternations are present in a Regex. Current code and Prototype: Note: The benchmarks lie in the sense that the setup phase of AC is not accounted for. The allocations to prepare
|
@@ -684,6 +684,7 @@ public static bool TryGetSingleRange(string set, out char lowInclusive, out char | |||
return false; | |||
} | |||
|
|||
// Possible place that can be optimized with an IndexSet. Or the return can be modified to use RegexCharClass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore the changes in this file. They are just notes for me
if (!_caseInsensitive) | ||
{ | ||
while (c != 0) | ||
if (!str.AsSpan().SequenceEqual(runtext.AsSpan(pos - c, c))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should see some benefit here from the vectorization work that went into string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's another one on line 302.
is there a utility to compare spans ignoring case? we have several of those loops too.
@@ -136,6 +136,32 @@ public abstract class RegexRunner | |||
runtextbeg = textbeg; | |||
runtextend = textend; | |||
|
|||
if (!regex.RightToLeft && regex.aho != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the crux of the prototype. Returns on the first match for now, but can easily be extended to work for multiple matches.
If regex.aho
is populated, we avoid the whole while (true)
loop below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why for Boyer Moore do we implement it explicitly in the emitted code (RegexCompiler) but for AC we can get away with doing all the job in this utility function? Is it efficiency vs. code complexity? (I haven't thought this all through yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I've only prototyped the interpreted paths. I have no idea how the compiled paths will need to turn out
public void Explore() | ||
{ | ||
//var regex2 = new Regex("abc(efg|hij)xyz"); | ||
//var regex1 = new Regex("((abc|def)mno|(xyz|abc)ghi)rst"); // AC word list still has a bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prototype has a bug while populating the word list for this pattern. When walking the tree in RegexWriter
, there is an error in expanding the word list. I'll fix it later :)
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexWriter.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs
Show resolved
Hide resolved
{ | ||
public TrieNode() | ||
{ | ||
Children = new Dictionary<char, int>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guessing a Dictionary is pretty heavyweight in most cases. There are various implementations floating around of dictionaries that start of as lists.
|
||
namespace System.Text.RegularExpressions | ||
{ | ||
internal class TrieNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, at some point. Right now it'd just force me to use annoying code patterns
Thanks for working on this. I've not yet looked at your changes (and probably won't be able to until January), but you might try running the regex redux benchmark, as it has a bunch of alternations. |
@@ -262,7 +267,11 @@ private bool MatchString(string str) | |||
|
|||
if (!_rightToLeft) | |||
{ | |||
pos += str.Length; | |||
//pos += str.Length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To compensate for the change above. Previously we were decreasing pos
by str.Length
while looking for a match => the next position to process is at pos + str.Length
. Now that I'm using SequenceEqual
, pos
doesn't need to change.
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexAhoCorasick.cs
Show resolved
Hide resolved
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexAhoCorasick.cs
Show resolved
Hide resolved
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexAhoCorasick.cs
Show resolved
Hide resolved
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexAhoCorasick.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexWriter.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexWriter.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexWriter.cs
Show resolved
Hide resolved
} | ||
else | ||
{ | ||
words = null!; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually here we do need the !
I think. I've declared words
as non-nullable, because it's annoying to type !.
everywhere
I'm interested in regexredux too, I guess you will need to convert |
if (!regex.RightToLeft && regex.aho != null) | ||
{ | ||
bool initialized1 = false; | ||
foreach ((int indexOfMatch, int lengthOfMatch) result in regex.aho.ProcessString(runtext)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use System.Range
?
Yup, I do plan on running the regex redux benchmark once this prototype is fully ready. Will post the results here then. Thanks for the reviews here meanwhile! |
Draft Pull Request was automatically closed for inactivity. It can be manually reopened in the next 30 days if the work resumes. |
This is draft PR, strictly for prototyping and exploring using Aho-Corasick (AC) for specific patterns in Regex and to gather initial feedback.
cc @danmosemsft @jeffhandley @eerhardt @stephentoub
Prelim perf results:
Over the past couple weeks, I spent some time prototyping AC in Regex. As #1349 points out, we should expect to see perf gains when alternations are present in a Regex.
Current code and Prototype:
We have a
FindFirstChar
andGo
loop to find matches in a given input.FindFirstChar
looks for the first matching character in the input. Once we find a matching character,Go
runs a loop to see if the following characters match a word in the Regex. This logic is however not efficient for multiple words. For ex: If we have the pattern(efg|xyz|hij)abcd
,FindFirstChar
looks for eithere, x, h
. ThenGo
checks forefgabcd, xyzabcd and hijabcd
in that order. So, for the inputhhhhhhhhhhhhijabcd
(12h
followed by abcd), at eachh
,Go
performs 3 checks => 11 * 3 = 33 checks before the 12thh
is matched andhijabcd
is found. In contrast, the AC algorithm scans linearly, so we perform only 11 checks before the 12thh
is matched and we findhijabcd
. This is where the speed up comes from!Note: The benchmarks lie in the sense that the setup phase of AC is not accounted for. The allocations to prepare
_ahoCorasickWords
are not measured in the benchmark. Also, the pattern and text used are the specifically chosen to see the speedup, but they are valid nonetheless :)