[API Proposal]: Efficiently matching multiple strings. #69682

teo-tsirpanis · 2022-05-23T19:01:33Z

Background and motivation

As explained in #62447, we want to use more sophisticated string searching algorithms such as Aho-Corasick and the vectorized Teddy in regexes.

Since these algorithms are quite complex, to avoid duplicating their logic in source-generated Regex code, I propose to add a class that performs such searching efficiently, with the more complex pre-processing being performed at construction-time, which will essentially serve the role of an optimized string.IndexOf(string[]).

API Proposal

namespace System.Text;

public sealed class MultiStringMatcher
{
    public ReadOnlySpan<string> Strings { get; }

    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(params string[] strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, ReadOnlySpan<string> strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, params string[] strings);

    public (int Index, int StringNumber) Find(ReadOnlySpan<char> text);
    public (int Index, int StringNumber) Find(string text);
}

[Flags]
public enum MultiStringMatcherOptions
{
    None = 0,
    CaseInsensitive = 1
}

Creating a MultiStringMatcher accepts an array or span of the strings it will recognize. The Find method accepts a string or read-only span of characters, and returns the position of the longest leftfost match of one of the strings passed to the constructor, and which of the strings it found. If it does not find anything, it will return (-1, -1). These strings are also available for later inspection through the Strings property.

The constructors are overloaded to accept a MultiStringMatcherOptions enum. Currently it allows case-insensitive searching (can be easily done in Aho-Corasick by adding additional edges to the trie, but Teddy won't be used if enabled), and in the future it may be used to enable dynamically generated code for the Aho-Corasick state transitions, similar to RegexOptions.Compiled.

If an empty array of strings is passed to the constructors, the resulting matcher's Find method will always return (-1, -1). If the same string appears in the constructor many times, the Find method could either return the index of the first, the index of the last, or throw at construction time (I'm not a fan of this option).

API Usage

var matcher = new MultiStringMatcher("foo", "bar", "baz");

Console.WriteLine(matcher.Find("foobar")); // Will print (0, 0)

Alternative Designs

An important design decision lies on whether we want this API to be general-purpose and usable by itself, or to provide only what source-generated Regexes need. I imagine the API reviewers want to pursue the latter direction. In this case the API would be simplified to:

// The namespace changed.
namespace System.Text.RegularExpressions;

public sealed class MultiStringMatcher
{
    // It does not cost much to provide this API but could be removed as well.
    public ReadOnlySpan<string> Strings { get; }

    // We remove the user-friendly superfluous overloads.
    // If we support code-generated Aho-Corasick in the future, this overload will allow trimming it away.
    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(ReadOnlySpan<string> strings, MultiStringMatcherOptions options);

    // Regexes only need the index, not the specific string that was matched.
    public int Find(ReadOnlySpan<char> text);
}

/// MultiStringMatcherOptions stays the same.

If we want this to be a general-purpose API, we have to also think of the following:

Is System.Text really a good fit for this? It mostly has to do with text encoding and formatting. My other thought is System but it's already pretty bloated.
How about supporting UTF-8? Teddy will particularly enjoy it since twice the characters could fit in the SIMD registers, but it would make supporting case-insensitive comparisons more complicated though.

And do we actually need case-insensitivity?

Risks

The idea that we would construct an object that performs expensive initialization in source-generated code seems a bit paradoxical, given that a purpose of source generators is to avoid expensive initialization. There's no good answer to this, we have to make sure it's worth the performance benefits and tune it appropriately. It would be great if Roslyn allowed generators adding files private for themselves, but that's a big if.

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2022-05-23T19:01:35Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2022-05-23T19:02:06Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

As explained in #62447, we want to use more sophisticated string searching algorithms such as Aho-Corasick and the vectorized Teddy in regexes.

Since these algorithms are quite complex, to avoid duplicating their logic in source-generated Regex code, I propose to add a class that performs such searching efficiently, with the more complex pre-processing being performed at construction-time, which will essentially serve the role of an optimized string.IndexOf(string[]).

API Proposal

namespace System.Text;

public sealed class MultiStringMatcher
{
    public ReadOnlySpan<string> Strings { get; }

    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(params string[] strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, ReadOnlySpan<string> strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, params string[] strings);

    public (int Index, int StringNumber) Find(ReadOnlySpan<char> text);
    public (int Index, int StringNumber) Find(string text);
}

[Flags]
public enum MultiStringMatcherOptions
{
    None = 0,
    CaseInsensitive = 1
}

Creating a MultiStringMatcher accepts an array or span of the strings it will recognize. The Find method accepts a string or read-only span of characters, and returns the position of the longest leftfost match of one of the strings passed to the constructor, and which of the strings it found. If it does not find anything, it will return (-1, -1). These strings are also available for later inspection through the Strings property.

The constructors are overloaded to accept a MultiStringMatcherOptions enum. Currently it allows case-insensitive searching (can be easily done in Aho-Corasick by adding additional edges to the trie, but Teddy won't be used if enabled), and in the future it may be used to enable dynamically generated code for the Aho-Corasick state transitions, similar to RegexOptions.Compiled.

If an empty array of strings is passed to the constructors, the resulting matcher's Find method will always return (-1, -1). If the same string appears in the constructor many times, the Find method could either return the index of the first, the index of the last, or throw at construction time (I'm not a fan of this option).

API Usage

var matcher = new MultiStringMatcher("foo", "bar", "baz");

Console.WriteLine(matcher.Find("foobar")); // Will print (0, 0)

Alternative Designs

An important design decision lies on whether we want this API to be general-purpose and usable by itself, or to provide only what source-generated Regexes need. I imagine the API reviewers want to pursue the latter direction. In this case the API would be simplified to:

// The namespace changed.
namespace System.Text.RegularExpressions;

public sealed class MultiStringMatcher
{
    // It does not cost much to provide this API but could be removed as well.
    public ReadOnlySpan<string> Strings { get; }

    // We remove the user-friendly superfluous overloads.
    // If we support code-generated Aho-Corasick in the future, this overload will allow trimming it away.
    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(ReadOnlySpan<string> strings, MultiStringMatcherOptions options);

    // Regexes only need the index, not the specific string that was matched.
    public int Find(ReadOnlySpan<char> text);
}

/// MultiStringMatcherOptions stays the same.

If we want this to be a general-purpose API, we have to also think of the following:

Is System.Text really a good fit for this? It mostly has to do with text encoding and formatting. My other thought is System but it's already pretty bloated.
How about supporting UTF-8? Teddy will particularly enjoy it since twice the characters could fit in the SIMD registers, but it would make supporting case-insensitive comparisons more complicated though.

And do we actually need case-insensitivity?

Risks

The idea that we would construct an object that performs expensive initialization in source-generated code seems a bit paradoxical, given that a purpose of source generators is to avoid expensive initialization. There's no good answer to this, we have to make sure it's worth the performance benefits and tune it appropriately. It would be great if Roslyn allowed generators adding files private for themselves, but that's a big if.

Author:	teo-tsirpanis
Assignees:	-
Labels:	`api-suggestion`, `area-System.Text.RegularExpressions`, `untriaged`
Milestone:	-

teo-tsirpanis · 2022-05-23T19:02:40Z

c.c. @stephentoub @danmoseley

danmoseley · 2022-05-23T19:32:08Z

My preference would be the Regex motivated API, as that's what we have the clearest motivation for. We will be able to immediately use and validate that API. It doesn't prevent us extending later if necessary.

One other thought, Find may not be the best name for the method since you're passing in the "thing to be found in". FindIn, perhaps? IndexIn?

danmoseley · 2022-05-23T19:38:29Z

Is there a way we can make this generic, so I could search for multiple byte sequences in a span of bytes? Internally, it would not be able to use any algorithm vectorized for char, but Aho-Corasick could work the same.

I haven't thought this through, not considered whether it would be used, just wanted to throw it out there before we bake string into the type name.

GrabYourPitchforks · 2022-05-24T00:46:24Z

How about supporting UTF-8? Teddy will particularly enjoy it since twice the characters could fit in the SIMD registers, but it would make supporting case-insensitive comparisons more complicated though.

I don't think we should add UTF-8 support at this time. As you point out, UTF-8 casing support is far more complicated than UTF-16 casing support, and there are significant performance considerations with it. We should trial balloon other simpler UTF-8 APIs (like regular case mapping) before taking on something as ambitious as this. That will give us the public feedback we need as to whether the usability and performance is acceptable for our consumers, and it allows us to get a feel for what API shapes have the greatest success rate.

danmoseley · 2022-06-13T20:17:54Z

How are things looking @teo-tsirpanis ?

teo-tsirpanis · 2022-06-13T23:48:47Z

I have done an initial implementation of AC and benchmarks are very promising. I've struggled over these days to implement the optimized leftmost matching done in the Rust aho-corasick package, but that could come later (I have implemented in a different way as explained in the linked comment above).

Until this API gets approved, we can try it in non-source-generated regexes. Thankfuly there is #45697 to see how to integrate it in the regex engine.

danmoseley · 2022-06-18T02:14:40Z

Thanks for the update @teo-tsirpanis. To confirm, you plan to continue with your next step to wire it up to regex? If so that sounds good.
Do you think we should get this API into review meantime, or wait?

BTW the team in this repo will be transitioning into mostly bug fix mode in mid July (this need not affect community contributions, but ideally anything that needs stabilization time will be in by then), then main will become a .NET 8 branch in mid August - the same plan we followed last year. It's totally fine if this ends up having to be .NET 8, just wanted you to be aware of the timelines in case we can get part of it into .NET 7!

teo-tsirpanis · 2022-06-18T15:06:02Z

Yes, I'm studying the Regex code to figure out how to integrate it. The proposal can be marked as ready for review in the meantime.

danmoseley · 2022-06-20T04:46:38Z

Thanks @teo-tsirpanis . The API review will be interested to see what the consumption in regex looks like, in order to demonstrate the API is the right shape. When you have that code working, we can mark ready for review, and expedite if it's necessary.

teo-tsirpanis · 2022-06-30T10:27:17Z

Thinking of this again, this API might not be actually needed, at least for Aho-Corasick. The algorithm's search logic is pretty simple and the regex source generator could emit something like this:

private readonly struct TrieNode {
    public Dictionary<char, int> Children { get; init; }
    public int SuffixLink { get; init; }
    public int DictionaryLink { get; init; }
}

private static readonly TrieNode[] s_trieNodes = new TrieNode[] {
    new TrieNode() {
        Children = new Dictionary() {
            ['a'] = 1,
            ['b'] = 2
        }
    },
    // ...
};

and emit a standard algorithm code that works over s_trieNodes. We also reduce the runtime initialization cost. I haven't looked very carefully into Teddy and can't say if the same applies to it as well.

As a sidenote, the non-source-generated Aho-Corasick implementation is going very well, and I am in the stage of testing and fixing bugs. I expect to open a PR by the end of the week.

stephentoub · 2022-06-30T10:58:44Z

I expect to open a PR by the end of the week.

Thanks for working on this. It might be counterintuitive, but I would like to avoid adding optimizations to the interpreter that aren't also in Compiled and the source generator, and similarly optimizations to Compiled that aren't in the source generator. It otherwise complicates the story and decision tree about which to use when. With such optimizations we've delayed adding them until they can be added everywhere, even if that means delaying their inclusion in some.

in the stage of testing and fixing bugs

Have you run any perf tests? Just from skimming your linked commits, I expect this will actually regress some patterns due to the lack of any vectorization; given a pattern like "abc|def" today we'll do something like IndexOfAny('c', 'f') to jump ahead, and if those are relatively infrequent in the input text, that will process much faster than processing a trie every character.

teo-tsirpanis · 2022-06-30T12:43:37Z

Thanks for the feedback. Compiled is very easy to support and I'm currently adding this, and the source generator work might take a bit longer. I will submit the PR after all three are supported.

As explained in #62447 (comment), I have compared the existing Regex engine with a separate AC implementation. I yesterday ran it with 2, 5, 8 and 10 strings, and here are the results:

Method	WordCount	Mean	Error	StdDev	Median	Ratio	RatioSD
CountWordsRegex	2	1,199.5 us	17.43 us	28.14 us	1,202.1 us	1.00	0.00
CountWordsCompiledRegex	2	180.4 us	4.93 us	14.54 us	175.6 us	0.16	0.01
CountWordsRegal	2	1,443.4 us	28.34 us	38.79 us	1,440.0 us	1.20	0.05

CountWordsRegex	5	2,479.7 us	42.85 us	37.98 us	2,489.3 us	1.00	0.00
CountWordsCompiledRegex	5	589.3 us	11.77 us	31.61 us	587.9 us	0.23	0.01
CountWordsRegal	5	1,284.4 us	25.11 us	44.63 us	1,276.1 us	0.53	0.03

CountWordsRegex	8	4,069.5 us	140.35 us	413.83 us	3,994.8 us	1.00	0.00
CountWordsCompiledRegex	8	825.2 us	16.30 us	16.74 us	820.8 us	0.22	0.01
CountWordsRegal	8	2,005.1 us	39.90 us	99.36 us	2,015.3 us	0.50	0.06

CountWordsRegex	10	4,673.7 us	226.28 us	638.22 us	4,440.4 us	1.00	0.00
CountWordsCompiledRegex	10	874.6 us	18.00 us	53.06 us	872.9 us	0.19	0.02
CountWordsRegal	10	1,985.6 us	37.63 us	33.36 us	1,978.2 us	0.45	0.01

Based on that, if the trie has less than five matches, it will not be used. The trie will be traversed in case we find a common prefix and use it in the LeadingString_*To* mode (at the time of writing this the relevant code has a bug which will be fixed), and if there isn't we try the rest of the RegexFindOptimizations modes.

stephentoub · 2022-06-30T13:02:14Z

Based on that, if the trie has less than five matches, it will not be used

Have you tried a pattern like (ab|cd|ef|gh|ij|kl)[mn]? The current code should vectorize the search for that [mn], at least for Compiled / source generator, e.g.

private string _str = new string(' ', 1_000_000);

[Benchmark(Baseline = true)]
public bool IsMatch1() => Regex.IsMatch(_str, "(ab|cd|ef|gh|ij|kl)", RegexOptions.Compiled);

[Benchmark]
public bool IsMatch2() => Regex.IsMatch(_str, "(ab|cd|ef|gh|ij|kl)[mn]", RegexOptions.Compiled);

Method	Mean	Error	StdDev	Ratio
IsMatch1	730.25 us	11.901 us	13.705 us	1.00
IsMatch2	54.87 us	1.095 us	1.737 us	0.08

stephentoub · 2022-06-30T13:05:20Z

(My general point is anything currently vectorized should typically be favored over the trie. That's more than just starting sets.)

teo-tsirpanis · 2022-07-19T19:39:45Z

#71588 implements Aho-Corasick without this API by directly emitting the algorithm's logic, bringing the initialization cost to zero and the matching performance to a level this API could never reach. I haven't investigated how easy it would be to do the same with Teddy, but a public API with a surface area specific to regexes that searches over multiple strings does not seem like the right answer, both design-wise and because we wouldn't get away with the initialization cost. Closing.

teo-tsirpanis added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label May 23, 2022

ghost added the untriaged New issue has not been triaged by the area owner label May 23, 2022

teo-tsirpanis added the area-System.Text.RegularExpressions label May 23, 2022

teo-tsirpanis self-assigned this May 23, 2022

joperezr removed the untriaged New issue has not been triaged by the area owner label Jun 29, 2022

joperezr added this to the Future milestone Jun 29, 2022

teo-tsirpanis mentioned this issue Jul 2, 2022

Add support for the Aho-Corasick algorithm in regexes. #71588

Closed

teo-tsirpanis closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2022

ghost locked as resolved and limited conversation to collaborators Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API Proposal]: Efficiently matching multiple strings. #69682

[API Proposal]: Efficiently matching multiple strings. #69682

teo-tsirpanis commented May 23, 2022

dotnet-issue-labeler bot commented May 23, 2022

ghost commented May 23, 2022

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

teo-tsirpanis commented May 23, 2022

danmoseley commented May 23, 2022

danmoseley commented May 23, 2022 •

edited

Loading

GrabYourPitchforks commented May 24, 2022 •

edited

Loading

danmoseley commented Jun 13, 2022

teo-tsirpanis commented Jun 13, 2022

danmoseley commented Jun 18, 2022

teo-tsirpanis commented Jun 18, 2022

danmoseley commented Jun 20, 2022

teo-tsirpanis commented Jun 30, 2022

stephentoub commented Jun 30, 2022 •

edited

Loading

teo-tsirpanis commented Jun 30, 2022

stephentoub commented Jun 30, 2022 •

edited

Loading

stephentoub commented Jun 30, 2022 •

edited

Loading

teo-tsirpanis commented Jul 19, 2022

[API Proposal]: Efficiently matching multiple strings. #69682

[API Proposal]: Efficiently matching multiple strings. #69682

Comments

teo-tsirpanis commented May 23, 2022

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

dotnet-issue-labeler bot commented May 23, 2022

ghost commented May 23, 2022

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

teo-tsirpanis commented May 23, 2022

danmoseley commented May 23, 2022

danmoseley commented May 23, 2022 • edited Loading

GrabYourPitchforks commented May 24, 2022 • edited Loading

danmoseley commented Jun 13, 2022

teo-tsirpanis commented Jun 13, 2022

danmoseley commented Jun 18, 2022

teo-tsirpanis commented Jun 18, 2022

danmoseley commented Jun 20, 2022

teo-tsirpanis commented Jun 30, 2022

stephentoub commented Jun 30, 2022 • edited Loading

teo-tsirpanis commented Jun 30, 2022

stephentoub commented Jun 30, 2022 • edited Loading

stephentoub commented Jun 30, 2022 • edited Loading

teo-tsirpanis commented Jul 19, 2022

danmoseley commented May 23, 2022 •

edited

Loading

GrabYourPitchforks commented May 24, 2022 •

edited

Loading

stephentoub commented Jun 30, 2022 •

edited

Loading

stephentoub commented Jun 30, 2022 •

edited

Loading

stephentoub commented Jun 30, 2022 •

edited

Loading