Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite the entire crate #40

Merged
merged 2 commits into from
Mar 28, 2019
Merged

rewrite the entire crate #40

merged 2 commits into from
Mar 28, 2019

Conversation

BurntSushi
Copy link
Owner

This PR introduces a ground-up rewrite of the entire crate. Most or
all use cases served by aho-corasick 0.6 should be served by this
rewrite as well. Pretty much everything has been improved. The API is
simpler, and much more flexible with many new configuration knobs for
controlling the space-vs-time tradeoffs of Aho-Corasick automatons. In
particular, there are several tunable optimizations for controlling
space usage such as state ID representation and byte classes.

The API is simpler in that there is now just one type that encapsulates
everything: AhoCorasick.

Support for streams has been improved quite a bit, with new APIs for
stream search & replace.

Test and benchmark coverage has increased quite a bit.

This also fixes a subtle but important bug: empty patterns are now
handled correctly. Previously, they could never match, but now they can
match at any position.

Finally, I believe this is now the only Aho-Corasick implementation to
support leftmost-first and leftmost-longest semantics by using what I
think is a novel alteration to the Aho-Corasick construction algorithm.
I surveyed some other implementations, and there are a few Java
libraries that support leftmost-longest match semantics, but they
implement it by adding a sliding queue at search time. I also looked
into Perl's regex implementation which has an Aho-Corasick optimization
for foo|bar|baz|...|quux style regexes, and therefore must somehow
implement leftmost-first semantics. The code is a bit hard to grok, but
it looks like this is being handled at search time as opposed to baking
it into the automaton.

Fixes #18, Fixes #19, Fixes #26, Closes #34

This ports over most of the benchmarks from the old benchmark harness
and adds a few more. We're also more principled about testing the NFA vs
DFA variant of Aho-Corasick.

The idea is that we can use this commit to extract benchmark data from
before the rewrite and compare it in a controlled fashion to the
rewritten crate.
This commit introduces a ground-up rewrite of the entire crate. Most or
all use cases served by `aho-corasick 0.6` should be served by this
rewrite as well. Pretty much everything has been improved. The API is
simpler, and much more flexible with many new configuration knobs for
controlling the space-vs-time tradeoffs of Aho-Corasick automatons. In
particular, there are several tunable optimizations for controlling
space usage such as state ID representation and byte classes.

The API is simpler in that there is now just one type that encapsulates
everything: `AhoCorasick`.

Support for streams has been improved quite a bit, with new APIs for
stream search & replace.

Test and benchmark coverage has increased quite a bit.

This also fixes a subtle but important bug: empty patterns are now
handled correctly. Previously, they could never match, but now they can
match at any position.

Finally, I believe this is now the only Aho-Corasick implementation to
support leftmost-first and leftmost-longest semantics by using what I
think is a novel alteration to the Aho-Corasick construction algorithm.
I surveyed some other implementations, and there are a few Java
libraries that support leftmost-longest match semantics, but they
implement it by adding a sliding queue at search time. I also looked
into Perl's regex implementation which has an Aho-Corasick optimization
for `foo|bar|baz|...|quux` style regexes, and therefore must somehow
implement leftmost-first semantics. The code is a bit hard to grok, but
it looks like this is being handled at search time as opposed to baking
it into the automaton.

Fixes #18, Fixes #19, Fixes #26, Closes #34
@BurntSushi BurntSushi merged commit e8493ba into master Mar 28, 2019
@BurntSushi BurntSushi deleted the ag/new-and-improved branch March 28, 2019 02:18
@Marwes Marwes mentioned this pull request Apr 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant