WIP: Explore using Aho-Corasick for specific patterns in Regex #45697

pgovind · 2020-12-07T19:25:12Z

This is draft PR, strictly for prototyping and exploring using Aho-Corasick (AC) for specific patterns in Regex and to gather initial feedback.
cc @danmosemsft @jeffhandley @eerhardt @stephentoub
Prelim perf results:

Benchmark:
        _ahoCorasickMultipleWords = new Regex("(efg|xyz|hij)abcd");

        [Benchmark] public void AhoCorasickMultipleWords() => _ahoCorasickMultipleWords.IsMatch(@"hhhhhhhhhhhhijabcd");

D:\repos\performance\src\tools\ResultsComparer>dotnet run --base "D:\repos\before_aho" --diff "D:\repos\after_aho" --threshold 0.001%
summary:
better: 3, geomean: 3.047
total diff: 3

No Slower results for the provided threshold = 0.001% and noise filter = 0.3ns.

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      3.09 |          1800.19 |           583.00 |         |
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      3.07 |          1775.07 |           578.28 |         |
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      2.99 |          1762.12 |           590.22 |         |

Over the past couple weeks, I spent some time prototyping AC in Regex. As #1349 points out, we should expect to see perf gains when alternations are present in a Regex.

Current code and Prototype:
We have a FindFirstChar and Go loop to find matches in a given input. FindFirstChar looks for the first matching character in the input. Once we find a matching character, Go runs a loop to see if the following characters match a word in the Regex. This logic is however not efficient for multiple words. For ex: If we have the pattern (efg|xyz|hij)abcd, FindFirstChar looks for either e, x, h. Then Go checks for efgabcd, xyzabcd and hijabcd in that order. So, for the input hhhhhhhhhhhhijabcd (12 h followed by abcd), at each h, Go performs 3 checks => 11 * 3 = 33 checks before the 12th h is matched and hijabcd is found. In contrast, the AC algorithm scans linearly, so we perform only 11 checks before the 12th h is matched and we find hijabcd. This is where the speed up comes from!

Note: The benchmarks lie in the sense that the setup phase of AC is not accounted for. The allocations to prepare _ahoCorasickWords are not measured in the benchmark. Also, the pattern and text used are the specifically chosen to see the speedup, but they are valid nonetheless :)

ghost · 2020-12-07T19:25:17Z

Tagging subscribers to this area: @eerhardt, @pgovind
See info in area-owners.md if you want to be subscribed.

Issue Details

This is draft PR, strictly for prototyping and exploring using Aho-Corasick (AC) for specific patterns in Regex and to gather initial feedback.
cc @danmosemsft @jeffhandley @eerhardt @stephentoub
Prelim perf results:

Benchmark:
        _ahoCorasickMultipleWords = new Regex("(efg|xyz|hij)abcd");

        [Benchmark] public void AhoCorasickMultipleWords() => _ahoCorasickMultipleWords.IsMatch(@"hhhhhhhhhhhhijabcd");

D:\repos\performance\src\tools\ResultsComparer>dotnet run --base "D:\repos\before_aho" --diff "D:\repos\after_aho" --threshold 0.001%
summary:
better: 3, geomean: 3.047
total diff: 3

No Slower results for the provided threshold = 0.001% and noise filter = 0.3ns.

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      3.09 |          1800.19 |           583.00 |         |
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      3.07 |          1775.07 |           578.28 |         |
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.AhoCorasickMultipleWords( |      2.99 |          1762.12 |           590.22 |         |

Over the past couple weeks, I spent some time prototyping AC in Regex. As #1349 points out, we should expect to see perf gains when alternations are present in a Regex.

Current code and Prototype:
We have a FindFirstChar and Go loop to find matches in a given input. FindFirstChar looks for the first matching character in the input. Once we find a matching character, Go runs a loop to see if the following characters match a word in the Regex. This logic is however not efficient for multiple words. For ex: If we have the pattern (efg|xyz|hij)abcd, FindFirstChar looks for either e, x, h. Then Go checks for efgabcd, xyzabcd and hijabcd in that order. So, for the input hhhhhhhhhhhhijabcd (12 h followed by abcd), at each h, Go performs 3 checks => 11 * 3 = 33 checks before the 12th h is matched and hijabcd is found. In contrast, the AC algorithm scans linearly, so we perform only 11 checks before the 12th h is matched and we find hijabcd. This is where the speed up comes from!

Note: The benchmarks lie in the sense that the setup phase of AC is not accounted for. The allocations to prepare _ahoCorasickWords are not measured in the benchmark. Also, the pattern and text used are the specifically chosen to see the speedup, but they are valid nonetheless :)

Author:	pgovind
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

pgovind · 2020-12-07T19:27:47Z

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

@@ -684,6 +684,7 @@ public static bool TryGetSingleRange(string set, out char lowInclusive, out char
            return false;
        }

+        // Possible place that can be optimized with an IndexSet. Or the return can be modified to use RegexCharClass


Ignore the changes in this file. They are just notes for me

pgovind · 2020-12-07T19:28:48Z

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

            if (!_caseInsensitive)
            {
-                while (c != 0)
+                if (!str.AsSpan().SequenceEqual(runtext.AsSpan(pos - c, c)))


We should see some benefit here from the vectorization work that went into string

there's another one on line 302.

is there a utility to compare spans ignoring case? we have several of those loops too.

pgovind · 2020-12-07T19:30:02Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

@@ -136,6 +136,32 @@ public abstract class RegexRunner
            runtextbeg = textbeg;
            runtextend = textend;

+            if (!regex.RightToLeft && regex.aho != null)


This is the crux of the prototype. Returns on the first match for now, but can easily be extended to work for multiple matches.

If regex.aho is populated, we avoid the whole while (true) loop below

Why for Boyer Moore do we implement it explicitly in the emitted code (RegexCompiler) but for AC we can get away with doing all the job in this utility function? Is it efficiency vs. code complexity? (I haven't thought this all through yet)

Actually I've only prototyped the interpreted paths. I have no idea how the compiled paths will need to turn out

pgovind · 2020-12-07T19:33:01Z

src/libraries/System.Text.RegularExpressions/tests/Regex.Match.Tests.cs

+        public void Explore()
+        {
+            //var regex2 = new Regex("abc(efg|hij)xyz");
+            //var regex1 = new Regex("((abc|def)mno|(xyz|abc)ghi)rst"); // AC word list still has a bug


This prototype has a bug while populating the word list for this pattern. When walking the tree in RegexWriter, there is an error in expanding the word list. I'll fix it later :)

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexWriter.cs

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

danmoseley · 2020-12-07T19:39:07Z

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexAhoCorasick.cs

+    {
+        public TrieNode()
+        {
+            Children = new Dictionary<char, int>();


I am guessing a Dictionary is pretty heavyweight in most cases. There are various implementations floating around of dictionaries that start of as lists.

danmoseley · 2020-12-07T19:39:12Z

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexAhoCorasick.cs

+
+namespace System.Text.RegularExpressions
+{
+    internal class TrieNode


Yup, at some point. Right now it'd just force me to use annoying code patterns

stephentoub · 2020-12-07T19:43:42Z

Thanks for working on this. I've not yet looked at your changes (and probably won't be able to until January), but you might try running the regex redux benchmark, as it has a bunch of alternations.

danmoseley · 2020-12-07T19:46:29Z

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

@@ -262,7 +267,11 @@ private bool MatchString(string str)

            if (!_rightToLeft)
            {
-                pos += str.Length;
+                //pos += str.Length;


To compensate for the change above. Previously we were decreasing pos by str.Length while looking for a match => the next position to process is at pos + str.Length. Now that I'm using SequenceEqual, pos doesn't need to change.

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexAhoCorasick.cs

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexWriter.cs

danmoseley · 2020-12-07T22:22:01Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexWriter.cs

+            }
+            else
+            {
+                words = null!;


no need for !

Actually here we do need the ! I think. I've declared words as non-nullable, because it's annoying to type !. everywhere

danmoseley · 2020-12-07T22:22:55Z

I'm interested in regexredux too, I guess you will need to convert ProcessString(..) to take a span so it can do multiple matches.

eerhardt · 2020-12-08T16:38:18Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

+            if (!regex.RightToLeft && regex.aho != null)
+            {
+                bool initialized1 = false;
+                foreach ((int indexOfMatch, int lengthOfMatch) result in regex.aho.ProcessString(runtext))


Use System.Range ?

pgovind · 2020-12-16T19:24:20Z

interested in regexredux too

running the regex redux benchmark

Yup, I do plan on running the regex redux benchmark once this prototype is fully ready. Will post the results here then. Thanks for the reviews here meanwhile!

ghost · 2021-03-31T10:00:25Z

Draft Pull Request was automatically closed for inactivity. It can be manually reopened in the next 30 days if the work resumes.

Prashanth Govindarajan added 4 commits November 16, 2020 11:59

Unit test to explore

3f29573

Vectorize MatchString for case sensitive

c08fea2

Use AhoCorasick

fb08b06

FIrst cut of AC

8f68654

Dotnet-GitSync-Bot added the area-System.Text.RegularExpressions label Dec 7, 2020

pgovind requested review from stephentoub, danmoseley, eerhardt and jeffhandley December 7, 2020 19:25

pgovind commented Dec 7, 2020

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexWriter.cs Show resolved Hide resolved