-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crate overhaul #34
Comments
Guessing you mean this link (from the code comments) https://github.com/mischasan/aho-corasick ? Started reading the google doc linked from it and it looks rather intriguing. |
@Marwes Oh hmm I forgot about that. I wasn't necessarily thinking about that, but something perhaps more basic:
I don't want to completely over-exert myself here, because I have bigger plans I want to get to, but I'll give mischasan's work a glance see if we can at least make it possible to do that work later. |
Looked into this briefly while optimizing but quickly moved on to lower hanging things. Of note is that bf693c0 got its large speedup precisely in the case where we use the
This is part of https://github.com/mischasan/aho-corasick at least. The extra memory access is unfortunate though perhaps the increased locality could weigh up for it 🤷♂️ |
Yeah the increased locality is a good argument. It's been years at this point, but when I benchmarked the regex crate with and without the compression, the lack of compression ended up being faster. But that might have been on regexes small enough such that locality was optimal for both. Not sure. |
I have some evidence that the increase locality does indeed have some impact. One of my benchmarks in the rewrite builds an automaton with 5,000 English dictionary words and searches it against random text such that there are no matches. An automaton with byte classes is substantially faster than an automaton without byte classes:
In particular, the benchmark with byte classes appears to make much better use of my CPU's L1 cache, where the benchmark without byte classes leans more heavily on the LL cache.
This is only one benchmark, and the results are likely sensitive to the size of the input and the size of the automaton, but I found this to be a decent piece of evidence. |
This commit introduces a ground-up rewrite of the entire crate. Most or all use cases served by `aho-corasick 0.6` should be served by this rewrite as well. Pretty much everything has been improved. The API is simpler, and much more flexible with many new configuration knobs for controlling the space-vs-time tradeoffs of Aho-Corasick automatons. In particular, there are several tunable optimizations for controlling space usage such as state ID representation and byte classes. The API is simpler in that there is now just one type that encapsulates everything: `AhoCorasick`. Support for streams has been improved quite a bit, with new APIs for stream search & replace. Test and benchmark coverage has increased quite a bit. This also fixes a subtle but important bug: empty patterns are now handled correctly. Previously, they could never match, but now they can match at any position. Finally, I believe this is now the only Aho-Corasick implementation to support leftmost-first and leftmost-longest semantics by using what I think is a novel alteration to the Aho-Corasick construction algorithm. I surveyed some other implementations, and there are a few Java libraries that support leftmost-longest match semantics, but they implement it by adding a sliding queue at search time. I also looked into Perl's regex implementation which has an Aho-Corasick optimization for `foo|bar|baz|...|quux` style regexes, and therefore must somehow implement leftmost-first semantics. The code is a bit hard to grok, but it looks like this is being handled at search time as opposed to baking it into the automaton. Fixes #18, Fixes #19, Fixes #26, Closes #34
This commit introduces a ground-up rewrite of the entire crate. Most or all use cases served by `aho-corasick 0.6` should be served by this rewrite as well. Pretty much everything has been improved. The API is simpler, and much more flexible with many new configuration knobs for controlling the space-vs-time tradeoffs of Aho-Corasick automatons. In particular, there are several tunable optimizations for controlling space usage such as state ID representation and byte classes. The API is simpler in that there is now just one type that encapsulates everything: `AhoCorasick`. Support for streams has been improved quite a bit, with new APIs for stream search & replace. Test and benchmark coverage has increased quite a bit. This also fixes a subtle but important bug: empty patterns are now handled correctly. Previously, they could never match, but now they can match at any position. Finally, I believe this is now the only Aho-Corasick implementation to support leftmost-first and leftmost-longest semantics by using what I think is a novel alteration to the Aho-Corasick construction algorithm. I surveyed some other implementations, and there are a few Java libraries that support leftmost-longest match semantics, but they implement it by adding a sliding queue at search time. I also looked into Perl's regex implementation which has an Aho-Corasick optimization for `foo|bar|baz|...|quux` style regexes, and therefore must somehow implement leftmost-first semantics. The code is a bit hard to grok, but it looks like this is being handled at search time as opposed to baking it into the automaton. Fixes #18, Fixes #19, Fixes #26, Closes #34
This commit introduces a ground-up rewrite of the entire crate. Most or all use cases served by `aho-corasick 0.6` should be served by this rewrite as well. Pretty much everything has been improved. The API is simpler, and much more flexible with many new configuration knobs for controlling the space-vs-time tradeoffs of Aho-Corasick automatons. In particular, there are several tunable optimizations for controlling space usage such as state ID representation and byte classes. The API is simpler in that there is now just one type that encapsulates everything: `AhoCorasick`. Support for streams has been improved quite a bit, with new APIs for stream search & replace. Test and benchmark coverage has increased quite a bit. This also fixes a subtle but important bug: empty patterns are now handled correctly. Previously, they could never match, but now they can match at any position. Finally, I believe this is now the only Aho-Corasick implementation to support leftmost-first and leftmost-longest semantics by using what I think is a novel alteration to the Aho-Corasick construction algorithm. I surveyed some other implementations, and there are a few Java libraries that support leftmost-longest match semantics, but they implement it by adding a sliding queue at search time. I also looked into Perl's regex implementation which has an Aho-Corasick optimization for `foo|bar|baz|...|quux` style regexes, and therefore must somehow implement leftmost-first semantics. The code is a bit hard to grok, but it looks like this is being handled at search time as opposed to baking it into the automaton. Fixes #18, Fixes #19, Fixes #26, Closes #34
This commit introduces a ground-up rewrite of the entire crate. Most or all use cases served by `aho-corasick 0.6` should be served by this rewrite as well. Pretty much everything has been improved. The API is simpler, and much more flexible with many new configuration knobs for controlling the space-vs-time tradeoffs of Aho-Corasick automatons. In particular, there are several tunable optimizations for controlling space usage such as state ID representation and byte classes. The API is simpler in that there is now just one type that encapsulates everything: `AhoCorasick`. Support for streams has been improved quite a bit, with new APIs for stream search & replace. Test and benchmark coverage has increased quite a bit. This also fixes a subtle but important bug: empty patterns are now handled correctly. Previously, they could never match, but now they can match at any position. Finally, I believe this is now the only Aho-Corasick implementation to support leftmost-first and leftmost-longest semantics by using what I think is a novel alteration to the Aho-Corasick construction algorithm. I surveyed some other implementations, and there are a few Java libraries that support leftmost-longest match semantics, but they implement it by adding a sliding queue at search time. I also looked into Perl's regex implementation which has an Aho-Corasick optimization for `foo|bar|baz|...|quux` style regexes, and therefore must somehow implement leftmost-first semantics. The code is a bit hard to grok, but it looks like this is being handled at search time as opposed to baking it into the automaton. Fixes #18, Fixes #19, Fixes #26, Closes #34
I think this crate is due for overhaul. I think it has existed more or less in its current form for about three years now, but there are some things I'd like to add in order to lay the ground work for better regex optimizations. Here is what I'm thinking:
Automaton
trait. Or at least, don't export it.cc @Marwes
The text was updated successfully, but these errors were encountered: