Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Explore using Aho-Corasick for specific patterns in Regex #45697

Closed
wants to merge 4 commits into from

Conversation

pgovind
Copy link

@pgovind pgovind commented Dec 7, 2020

This is draft PR, strictly for prototyping and exploring using Aho-Corasick (AC) for specific patterns in Regex and to gather initial feedback.
cc @danmosemsft @jeffhandley @eerhardt @stephentoub
Prelim perf results:

Benchmark:
        _ahoCorasickMultipleWords = new Regex("(efg|xyz|hij)abcd");

        [Benchmark] public void AhoCorasickMultipleWords() => _ahoCorasickMultipleWords.IsMatch(@"hhhhhhhhhhhhijabcd");

D:\repos\performance\src\tools\ResultsComparer>dotnet run --base "D:\repos\before_aho" --diff "D:\repos\after_aho" --threshold 0.001%
summary:
better: 3, geomean: 3.047
total diff: 3

No Slower results for the provided threshold = 0.001% and noise filter = 0.3ns.

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      3.09 |          1800.19 |           583.00 |         |
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      3.07 |          1775.07 |           578.28 |         |
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      2.99 |          1762.12 |           590.22 |         |

Over the past couple weeks, I spent some time prototyping AC in Regex. As #1349 points out, we should expect to see perf gains when alternations are present in a Regex.

Current code and Prototype:
We have a FindFirstChar and Go loop to find matches in a given input. FindFirstChar looks for the first matching character in the input. Once we find a matching character, Go runs a loop to see if the following characters match a word in the Regex. This logic is however not efficient for multiple words. For ex: If we have the pattern (efg|xyz|hij)abcd, FindFirstChar looks for either e, x, h. Then Go checks for efgabcd, xyzabcd and hijabcd in that order. So, for the input hhhhhhhhhhhhijabcd (12 h followed by abcd), at each h, Go performs 3 checks => 11 * 3 = 33 checks before the 12th h is matched and hijabcd is found. In contrast, the AC algorithm scans linearly, so we perform only 11 checks before the 12th h is matched and we find hijabcd. This is where the speed up comes from!

Note: The benchmarks lie in the sense that the setup phase of AC is not accounted for. The allocations to prepare _ahoCorasickWords are not measured in the benchmark. Also, the pattern and text used are the specifically chosen to see the speedup, but they are valid nonetheless :)

@ghost
Copy link

ghost commented Dec 7, 2020

Tagging subscribers to this area: @eerhardt, @pgovind
See info in area-owners.md if you want to be subscribed.

Issue Details

This is draft PR, strictly for prototyping and exploring using Aho-Corasick (AC) for specific patterns in Regex and to gather initial feedback.
cc @danmosemsft @jeffhandley @eerhardt @stephentoub
Prelim perf results:

Benchmark:
        _ahoCorasickMultipleWords = new Regex("(efg|xyz|hij)abcd");

        [Benchmark] public void AhoCorasickMultipleWords() => _ahoCorasickMultipleWords.IsMatch(@"hhhhhhhhhhhhijabcd");

D:\repos\performance\src\tools\ResultsComparer>dotnet run --base "D:\repos\before_aho" --diff "D:\repos\after_aho" --threshold 0.001%
summary:
better: 3, geomean: 3.047
total diff: 3

No Slower results for the provided threshold = 0.001% and noise filter = 0.3ns.

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      3.09 |          1800.19 |           583.00 |         |
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      3.07 |          1775.07 |           578.28 |         |
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      2.99 |          1762.12 |           590.22 |         |

Over the past couple weeks, I spent some time prototyping AC in Regex. As #1349 points out, we should expect to see perf gains when alternations are present in a Regex.

Current code and Prototype:
We have a FindFirstChar and Go loop to find matches in a given input. FindFirstChar looks for the first matching character in the input. Once we find a matching character, Go runs a loop to see if the following characters match a word in the Regex. This logic is however not efficient for multiple words. For ex: If we have the pattern (efg|xyz|hij)abcd, FindFirstChar looks for either e, x, h. Then Go checks for efgabcd, xyzabcd and hijabcd in that order. So, for the input hhhhhhhhhhhhijabcd (12 h followed by abcd), at each h, Go performs 3 checks => 11 * 3 = 33 checks before the 12th h is matched and hijabcd is found. In contrast, the AC algorithm scans linearly, so we perform only 11 checks before the 12th h is matched and we find hijabcd. This is where the speed up comes from!

Note: The benchmarks lie in the sense that the setup phase of AC is not accounted for. The allocations to prepare _ahoCorasickWords are not measured in the benchmark. Also, the pattern and text used are the specifically chosen to see the speedup, but they are valid nonetheless :)

Author: pgovind
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@@ -684,6 +684,7 @@ public static bool TryGetSingleRange(string set, out char lowInclusive, out char
return false;
}

// Possible place that can be optimized with an IndexSet. Or the return can be modified to use RegexCharClass
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore the changes in this file. They are just notes for me

if (!_caseInsensitive)
{
while (c != 0)
if (!str.AsSpan().SequenceEqual(runtext.AsSpan(pos - c, c)))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should see some benefit here from the vectorization work that went into string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's another one on line 302.

is there a utility to compare spans ignoring case? we have several of those loops too.

@@ -136,6 +136,32 @@ public abstract class RegexRunner
runtextbeg = textbeg;
runtextend = textend;

if (!regex.RightToLeft && regex.aho != null)
Copy link
Author

@pgovind pgovind Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the crux of the prototype. Returns on the first match for now, but can easily be extended to work for multiple matches.

If regex.aho is populated, we avoid the whole while (true) loop below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why for Boyer Moore do we implement it explicitly in the emitted code (RegexCompiler) but for AC we can get away with doing all the job in this utility function? Is it efficiency vs. code complexity? (I haven't thought this all through yet)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I've only prototyped the interpreted paths. I have no idea how the compiled paths will need to turn out

public void Explore()
{
//var regex2 = new Regex("abc(efg|hij)xyz");
//var regex1 = new Regex("((abc|def)mno|(xyz|abc)ghi)rst"); // AC word list still has a bug
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prototype has a bug while populating the word list for this pattern. When walking the tree in RegexWriter, there is an error in expanding the word list. I'll fix it later :)

{
public TrieNode()
{
Children = new Dictionary<char, int>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing a Dictionary is pretty heavyweight in most cases. There are various implementations floating around of dictionaries that start of as lists.


namespace System.Text.RegularExpressions
{
internal class TrieNode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, at some point. Right now it'd just force me to use annoying code patterns

@stephentoub
Copy link
Member

Thanks for working on this. I've not yet looked at your changes (and probably won't be able to until January), but you might try running the regex redux benchmark, as it has a bunch of alternations.

@@ -262,7 +267,11 @@ private bool MatchString(string str)

if (!_rightToLeft)
{
pos += str.Length;
//pos += str.Length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To compensate for the change above. Previously we were decreasing pos by str.Length while looking for a match => the next position to process is at pos + str.Length. Now that I'm using SequenceEqual, pos doesn't need to change.

}
else
{
words = null!;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for !

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually here we do need the ! I think. I've declared words as non-nullable, because it's annoying to type !. everywhere

@danmoseley
Copy link
Member

I'm interested in regexredux too, I guess you will need to convert ProcessString(..) to take a span so it can do multiple matches.

if (!regex.RightToLeft && regex.aho != null)
{
bool initialized1 = false;
foreach ((int indexOfMatch, int lengthOfMatch) result in regex.aho.ProcessString(runtext))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use System.Range ?

@pgovind
Copy link
Author

pgovind commented Dec 16, 2020

interested in regexredux too

running the regex redux benchmark

Yup, I do plan on running the regex redux benchmark once this prototype is fully ready. Will post the results here then. Thanks for the reviews here meanwhile!

@jeffhandley jeffhandley marked this pull request as draft January 23, 2021 02:31
@pgovind pgovind added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jan 28, 2021
Base automatically changed from master to main March 1, 2021 09:07
@ghost ghost closed this Mar 31, 2021
@ghost
Copy link

ghost commented Mar 31, 2021

Draft Pull Request was automatically closed for inactivity. It can be manually reopened in the next 30 days if the work resumes.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 30, 2021
@karelz karelz added this to the 6.0.0 milestone May 20, 2021
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants