-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: Efficiently matching multiple strings. #69682
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsBackground and motivationAs explained in #62447, we want to use more sophisticated string searching algorithms such as Aho-Corasick and the vectorized Teddy in regexes. Since these algorithms are quite complex, to avoid duplicating their logic in source-generated API Proposalnamespace System.Text;
public sealed class MultiStringMatcher
{
public ReadOnlySpan<string> Strings { get; }
public MultiStringMatcher(ReadOnlySpan<string> strings);
public MultiStringMatcher(params string[] strings);
public MultiStringMatcher(MultiStringMatcherOptions options, ReadOnlySpan<string> strings);
public MultiStringMatcher(MultiStringMatcherOptions options, params string[] strings);
public (int Index, int StringNumber) Find(ReadOnlySpan<char> text);
public (int Index, int StringNumber) Find(string text);
}
[Flags]
public enum MultiStringMatcherOptions
{
None = 0,
CaseInsensitive = 1
} Creating a The constructors are overloaded to accept a If an empty array of strings is passed to the constructors, the resulting matcher's API Usagevar matcher = new MultiStringMatcher("foo", "bar", "baz");
Console.WriteLine(matcher.Find("foobar")); // Will print (0, 0) Alternative DesignsAn important design decision lies on whether we want this API to be general-purpose and usable by itself, or to provide only what source-generated // The namespace changed.
namespace System.Text.RegularExpressions;
public sealed class MultiStringMatcher
{
// It does not cost much to provide this API but could be removed as well.
public ReadOnlySpan<string> Strings { get; }
// We remove the user-friendly superfluous overloads.
// If we support code-generated Aho-Corasick in the future, this overload will allow trimming it away.
public MultiStringMatcher(ReadOnlySpan<string> strings);
public MultiStringMatcher(ReadOnlySpan<string> strings, MultiStringMatcherOptions options);
// Regexes only need the index, not the specific string that was matched.
public int Find(ReadOnlySpan<char> text);
}
/// MultiStringMatcherOptions stays the same. If we want this to be a general-purpose API, we have to also think of the following:
And do we actually need case-insensitivity? RisksThe idea that we would construct an object that performs expensive initialization in source-generated code seems a bit paradoxical, given that a purpose of source generators is to avoid expensive initialization. There's no good answer to this, we have to make sure it's worth the performance benefits and tune it appropriately. It would be great if Roslyn allowed generators adding files private for themselves, but that's a big if.
|
c.c. @stephentoub @danmoseley |
My preference would be the Regex motivated API, as that's what we have the clearest motivation for. We will be able to immediately use and validate that API. It doesn't prevent us extending later if necessary. One other thought, Find may not be the best name for the method since you're passing in the "thing to be found in". FindIn, perhaps? IndexIn? |
Is there a way we can make this generic, so I could search for multiple byte sequences in a span of bytes? Internally, it would not be able to use any algorithm vectorized for char, but Aho-Corasick could work the same. I haven't thought this through, not considered whether it would be used, just wanted to throw it out there before we bake string into the type name. |
I don't think we should add UTF-8 support at this time. As you point out, UTF-8 casing support is far more complicated than UTF-16 casing support, and there are significant performance considerations with it. We should trial balloon other simpler UTF-8 APIs (like regular case mapping) before taking on something as ambitious as this. That will give us the public feedback we need as to whether the usability and performance is acceptable for our consumers, and it allows us to get a feel for what API shapes have the greatest success rate. |
How are things looking @teo-tsirpanis ? |
I have done an initial implementation of AC and benchmarks are very promising. I've struggled over these days to implement the optimized leftmost matching done in the Rust Until this API gets approved, we can try it in non-source-generated regexes. Thankfuly there is #45697 to see how to integrate it in the regex engine. |
Thanks for the update @teo-tsirpanis. To confirm, you plan to continue with your next step to wire it up to regex? If so that sounds good. BTW the team in this repo will be transitioning into mostly bug fix mode in mid July (this need not affect community contributions, but ideally anything that needs stabilization time will be in by then), then main will become a .NET 8 branch in mid August - the same plan we followed last year. It's totally fine if this ends up having to be .NET 8, just wanted you to be aware of the timelines in case we can get part of it into .NET 7! |
Yes, I'm studying the |
Thanks @teo-tsirpanis . The API review will be interested to see what the consumption in regex looks like, in order to demonstrate the API is the right shape. When you have that code working, we can mark ready for review, and expedite if it's necessary. |
Thinking of this again, this API might not be actually needed, at least for Aho-Corasick. The algorithm's search logic is pretty simple and the regex source generator could emit something like this: private readonly struct TrieNode {
public Dictionary<char, int> Children { get; init; }
public int SuffixLink { get; init; }
public int DictionaryLink { get; init; }
}
private static readonly TrieNode[] s_trieNodes = new TrieNode[] {
new TrieNode() {
Children = new Dictionary() {
['a'] = 1,
['b'] = 2
}
},
// ...
}; and emit a standard algorithm code that works over As a sidenote, the non-source-generated Aho-Corasick implementation is going very well, and I am in the stage of testing and fixing bugs. I expect to open a PR by the end of the week. |
Thanks for working on this. It might be counterintuitive, but I would like to avoid adding optimizations to the interpreter that aren't also in Compiled and the source generator, and similarly optimizations to Compiled that aren't in the source generator. It otherwise complicates the story and decision tree about which to use when. With such optimizations we've delayed adding them until they can be added everywhere, even if that means delaying their inclusion in some.
Have you run any perf tests? Just from skimming your linked commits, I expect this will actually regress some patterns due to the lack of any vectorization; given a pattern like "abc|def" today we'll do something like IndexOfAny('c', 'f') to jump ahead, and if those are relatively infrequent in the input text, that will process much faster than processing a trie every character. |
Thanks for the feedback. As explained in #62447 (comment), I have compared the existing Regex engine with a separate AC implementation. I yesterday ran it with 2, 5, 8 and 10 strings, and here are the results:
Based on that, if the trie has less than five matches, it will not be used. The trie will be traversed in case we find a common prefix and use it in the |
Have you tried a pattern like private string _str = new string(' ', 1_000_000);
[Benchmark(Baseline = true)]
public bool IsMatch1() => Regex.IsMatch(_str, "(ab|cd|ef|gh|ij|kl)", RegexOptions.Compiled);
[Benchmark]
public bool IsMatch2() => Regex.IsMatch(_str, "(ab|cd|ef|gh|ij|kl)[mn]", RegexOptions.Compiled);
|
(My general point is anything currently vectorized should typically be favored over the trie. That's more than just starting sets.) |
#71588 implements Aho-Corasick without this API by directly emitting the algorithm's logic, bringing the initialization cost to zero and the matching performance to a level this API could never reach. I haven't investigated how easy it would be to do the same with Teddy, but a public API with a surface area specific to regexes that searches over multiple strings does not seem like the right answer, both design-wise and because we wouldn't get away with the initialization cost. Closing. |
Background and motivation
As explained in #62447, we want to use more sophisticated string searching algorithms such as Aho-Corasick and the vectorized Teddy in regexes.
Since these algorithms are quite complex, to avoid duplicating their logic in source-generated
Regex
code, I propose to add a class that performs such searching efficiently, with the more complex pre-processing being performed at construction-time, which will essentially serve the role of an optimizedstring.IndexOf(string[])
.API Proposal
Creating a
MultiStringMatcher
accepts an array or span of the strings it will recognize. TheFind
method accepts a string or read-only span of characters, and returns the position of the longest leftfost match of one of the strings passed to the constructor, and which of the strings it found. If it does not find anything, it will return(-1, -1)
. These strings are also available for later inspection through theStrings
property.The constructors are overloaded to accept a
MultiStringMatcherOptions
enum. Currently it allows case-insensitive searching (can be easily done in Aho-Corasick by adding additional edges to the trie, but Teddy won't be used if enabled), and in the future it may be used to enable dynamically generated code for the Aho-Corasick state transitions, similar toRegexOptions.Compiled
.If an empty array of strings is passed to the constructors, the resulting matcher's
Find
method will always return(-1, -1)
. If the same string appears in the constructor many times, theFind
method could either return the index of the first, the index of the last, or throw at construction time (I'm not a fan of this option).API Usage
Alternative Designs
An important design decision lies on whether we want this API to be general-purpose and usable by itself, or to provide only what source-generated
Regex
es need. I imagine the API reviewers want to pursue the latter direction. In this case the API would be simplified to:If we want this to be a general-purpose API, we have to also think of the following:
System.Text
really a good fit for this? It mostly has to do with text encoding and formatting. My other thought isSystem
but it's already pretty bloated.And do we actually need case-insensitivity?
Risks
The idea that we would construct an object that performs expensive initialization in source-generated code seems a bit paradoxical, given that a purpose of source generators is to avoid expensive initialization. There's no good answer to this, we have to make sure it's worth the performance benefits and tune it appropriately. It would be great if Roslyn allowed generators adding files private for themselves, but that's a big if.
The text was updated successfully, but these errors were encountered: