Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crate overhaul #34

Closed
BurntSushi opened this issue Aug 27, 2018 · 5 comments · Fixed by #40
Closed

crate overhaul #34

BurntSushi opened this issue Aug 27, 2018 · 5 comments · Fixed by #40

Comments

@BurntSushi
Copy link
Owner

BurntSushi commented Aug 27, 2018

I think this crate is due for overhaul. I think it has existed more or less in its current form for about three years now, but there are some things I'd like to add in order to lay the ground work for better regex optimizations. Here is what I'm thinking:

  • Switch to a more builder oriented API.
  • Get rid of the Automaton trait. Or at least, don't export it.
  • Fix the dense/sparse swap that has been driving me nuts (see Using of sparse and dense is backwards #18).
  • Refine the use of memchr. It is currently quite sub-optimal on non-ASCII text.
  • Employ ideas to make the automaton more memory efficient, or at least, make the trade off more easily configurable.
  • Add support for "leftmost first" and "leftmost longest" match semantics. This is an important piece for using aho-corasick effectively in a regex engine. We will retain the existing match semantics as well. I believe the leftmost semantics will require small tweaks to how the automaton is constructed (e.g., by causing match states to die instead of automatically wrapping around to start again).
  • Don't require the automaton to hold on to the strings (see Is it possible to throw away the strings? #19).
  • Try switching to criterion for benchmarks.

cc @Marwes

@Marwes
Copy link
Contributor

Marwes commented Aug 27, 2018

Employ ideas to make the automaton more memory efficient, or at least, make the trade off more easily configurable.

Guessing you mean this link (from the code comments) https://github.com/mischasan/aho-corasick ? Started reading the google doc linked from it and it looks rather intriguing.

@BurntSushi
Copy link
Owner Author

BurntSushi commented Aug 27, 2018

@Marwes Oh hmm I forgot about that. I wasn't necessarily thinking about that, but something perhaps more basic:

  • Permit more control over the depth of dense/sparse transition representations. (I think right now we use a dense representation for the first 3 levels and sparse for the rest, but is this actually best?)
  • Represent dense transitions more compactly by collapsing bytes into equivalence classes. This is similar to what the regex and csv crates do to save on space. e.g., If you have foo|bar|baz, then you have at most 7 equivalence classes: a, b, f, o, r, z and everything else. So you can have one 256 byte map that translates an arbitrary byte to one of the equivalence classes, and then the result of that is used in the transition lookup, which now has an alphabet of 7 instead of 256. The cost is an extra memory access though, and this can impact performance, so we probably want to make it configurable.

I don't want to completely over-exert myself here, because I have bigger plans I want to get to, but I'll give mischasan's work a glance see if we can at least make it possible to do that work later.

@Marwes
Copy link
Contributor

Marwes commented Aug 27, 2018

Permit more control over the depth of dense/sparse transition representations. (I think right now we use a dense representation for the first 3 levels and sparse for the rest, but is this actually best?)

Looked into this briefly while optimizing but quickly moved on to lower hanging things. Of note is that bf693c0 got its large speedup precisely in the case where we use the sparse representation because that one is actually faster during construction (if there are few ok transitions). So if we only use AcAutomation to build a FullAcAutomaton it is quite possible that using only sparse states would improve the build time (despite having slower lookups).

Represent dense transitions more compactly by collapsing bytes into equivalence classes. This is similar to what the regex and csv crates do to save on space. e.g., If you have foo|bar|baz, then you have at most 7 equivalence classes: a, b, f, o, r, z and everything else. So you can have one 256 byte map that translates an arbitrary byte to one of the equivalence classes, and then the result of that is used in the transition lookup, which now has an alphabet of 7 instead of 256. The cost is an extra memory access though, and this can impact performance, so we probably want to make it configurable.

This is part of https://github.com/mischasan/aho-corasick at least. The extra memory access is unfortunate though perhaps the increased locality could weigh up for it 🤷‍♂️

@BurntSushi
Copy link
Owner Author

BurntSushi commented Aug 27, 2018

Yeah the increased locality is a good argument. It's been years at this point, but when I benchmarked the regex crate with and without the compression, the lack of compression ended up being faster. But that might have been on regexes small enough such that locality was optimal for both. Not sure.

@BurntSushi
Copy link
Owner Author

This is part of https://github.com/mischasan/aho-corasick at least. The extra memory access is unfortunate though perhaps the increased locality could weigh up for it man_shrugging

I have some evidence that the increase locality does indeed have some impact. One of my benchmarks in the rewrite builds an automaton with 5,000 English dictionary words and searches it against random text such that there are no matches. An automaton with byte classes is substantially faster than an automaton without byte classes:

random10x/leftmost-longest/dfa/nobyteclass-premultiply/5000words
                        time:   [568.86 us 569.41 us 569.96 us]
                        thrpt:  [167.34 MiB/s 167.50 MiB/s 167.66 MiB/s]

random10x/leftmost-longest/dfa/byteclass-premultiply/5000words
                        time:   [256.29 us 256.83 us 257.39 us]
                        thrpt:  [370.56 MiB/s 371.36 MiB/s 372.14 MiB/s]

In particular, the benchmark with byte classes appears to make much better use of my CPU's L1 cache, where the benchmark without byte classes leans more heavily on the LL cache.

perf stat output without byte classes:

 Performance counter stats for 'sh -c  cargo bench -- random10x/leftmost-longest/dfa/nobyteclass-premultiply --profile-time 3':

    11,313,342,091      instructions:u            #    0.79  insn per cycle           (16.63%)
                 0      context-switches:u
                 0      major-faults:u
           146,842      minor-faults:u
    14,366,559,369      cpu-cycles:u                                                  (22.21%)
       372,261,885      bus-cycles:u                                                  (27.75%)
    11,903,647,566      ref-cycles:u                                                  (33.28%)
   <not supported>      stalled-cycles-backend:u
   <not supported>      stalled-cycles-frontend:u
         1,495,669      L1-icache-load-misses:u                                       (33.28%)
     2,448,058,637      branches:u                                                    (33.28%)
        13,709,687      branch-misses:u           #    0.56% of all branches          (27.68%)
       165,361,967      cache-references:u                                            (27.68%)
        21,105,898      cache-misses:u            #   12.763 % of all cache refs      (22.17%)
                 0      mem-loads:u                                                   (22.23%)
       445,415,743      mem-stores:u                                                  (22.26%)
     2,912,568,497      L1-dcache-loads:u                                             (22.34%)
       428,791,475      L1-dcache-load-misses:u   #   14.72% of all L1-dcache hits    (22.39%)
   <not supported>      L1-dcache-prefetch-misses:u
       465,338,940      L1-dcache-stores:u                                            (22.36%)
   <not supported>      L1-dcache-store-misses:u
       114,685,132      LLC-loads:u                                                   (22.31%)
           705,539      LLC-load-misses:u         #    0.62% of all LL-cache hits     (22.23%)
   <not supported>      LLC-prefetch-misses:u
         5,416,693      LLC-stores:u                                                  (11.06%)
                 0      LLC-store-misses:u                                            (11.07%)

       3.958090628 seconds time elapsed

       3.782096000 seconds user
       0.165875000 seconds sys

perf stat with byte classes:

 Performance counter stats for 'sh -c  cargo bench -- random10x/leftmost-longest/dfa/byteclass-premultiply --profile-time 3':

    16,381,145,502      instructions:u            #    1.18  insn per cycle           (16.60%)
                 0      context-switches:u
                 0      major-faults:u
           146,816      minor-faults:u
    13,935,928,942      cpu-cycles:u                                                  (22.19%)
       370,263,509      bus-cycles:u                                                  (27.79%)
    11,860,659,686      ref-cycles:u                                                  (33.39%)
   <not supported>      stalled-cycles-backend:u
   <not supported>      stalled-cycles-frontend:u
         1,371,588      L1-icache-load-misses:u                                       (33.44%)
     3,694,703,211      branches:u                                                    (33.51%)
        12,426,884      branch-misses:u           #    0.34% of all branches          (27.94%)
        70,081,291      cache-references:u                                            (27.88%)
        12,288,777      cache-misses:u            #   17.535 % of all cache refs      (22.21%)
                 0      mem-loads:u                                                   (22.15%)
       431,712,749      mem-stores:u                                                  (22.22%)
     5,368,823,517      L1-dcache-loads:u                                             (22.24%)
       335,640,192      L1-dcache-load-misses:u   #    6.25% of all L1-dcache hits    (22.24%)
   <not supported>      L1-dcache-prefetch-misses:u
       424,523,253      L1-dcache-stores:u                                            (22.23%)
   <not supported>      L1-dcache-store-misses:u
        11,055,503      LLC-loads:u                                                   (22.12%)
         1,062,696      LLC-load-misses:u         #    9.61% of all LL-cache hits     (22.08%)
   <not supported>      LLC-prefetch-misses:u
         1,273,806      LLC-stores:u                                                  (11.07%)
                 0      LLC-store-misses:u                                            (11.07%)

       3.966444634 seconds time elapsed

       3.704133000 seconds user
       0.250810000 seconds sys

This is only one benchmark, and the results are likely sensitive to the size of the input and the size of the automaton, but I found this to be a decent piece of evidence.

BurntSushi added a commit that referenced this issue Mar 28, 2019
This commit introduces a ground-up rewrite of the entire crate. Most or
all use cases served by `aho-corasick 0.6` should be served by this
rewrite as well. Pretty much everything has been improved. The API is
simpler, and much more flexible with many new configuration knobs for
controlling the space-vs-time tradeoffs of Aho-Corasick automatons. In
particular, there are several tunable optimizations for controlling
space usage such as state ID representation and byte classes.

The API is simpler in that there is now just one type that encapsulates
everything: `AhoCorasick`.

Support for streams has been improved quite a bit, with new APIs for
stream search & replace.

Test and benchmark coverage has increased quite a bit.

This also fixes a subtle but important bug: empty patterns are now
handled correctly. Previously, they could never match, but now they can
match at any position.

Finally, I believe this is now the only Aho-Corasick implementation to
support leftmost-first and leftmost-longest semantics by using what I
think is a novel alteration to the Aho-Corasick construction algorithm.
I surveyed some other implementations, and there are a few Java
libraries that support leftmost-longest match semantics, but they
implement it by adding a sliding queue at search time. I also looked
into Perl's regex implementation which has an Aho-Corasick optimization
for `foo|bar|baz|...|quux` style regexes, and therefore must somehow
implement leftmost-first semantics. The code is a bit hard to grok, but
it looks like this is being handled at search time as opposed to baking
it into the automaton.

Fixes #18, Fixes #19, Fixes #26, Closes #34
BurntSushi added a commit that referenced this issue Mar 28, 2019
This commit introduces a ground-up rewrite of the entire crate. Most or
all use cases served by `aho-corasick 0.6` should be served by this
rewrite as well. Pretty much everything has been improved. The API is
simpler, and much more flexible with many new configuration knobs for
controlling the space-vs-time tradeoffs of Aho-Corasick automatons. In
particular, there are several tunable optimizations for controlling
space usage such as state ID representation and byte classes.

The API is simpler in that there is now just one type that encapsulates
everything: `AhoCorasick`.

Support for streams has been improved quite a bit, with new APIs for
stream search & replace.

Test and benchmark coverage has increased quite a bit.

This also fixes a subtle but important bug: empty patterns are now
handled correctly. Previously, they could never match, but now they can
match at any position.

Finally, I believe this is now the only Aho-Corasick implementation to
support leftmost-first and leftmost-longest semantics by using what I
think is a novel alteration to the Aho-Corasick construction algorithm.
I surveyed some other implementations, and there are a few Java
libraries that support leftmost-longest match semantics, but they
implement it by adding a sliding queue at search time. I also looked
into Perl's regex implementation which has an Aho-Corasick optimization
for `foo|bar|baz|...|quux` style regexes, and therefore must somehow
implement leftmost-first semantics. The code is a bit hard to grok, but
it looks like this is being handled at search time as opposed to baking
it into the automaton.

Fixes #18, Fixes #19, Fixes #26, Closes #34
BurntSushi added a commit that referenced this issue Mar 28, 2019
This commit introduces a ground-up rewrite of the entire crate. Most or
all use cases served by `aho-corasick 0.6` should be served by this
rewrite as well. Pretty much everything has been improved. The API is
simpler, and much more flexible with many new configuration knobs for
controlling the space-vs-time tradeoffs of Aho-Corasick automatons. In
particular, there are several tunable optimizations for controlling
space usage such as state ID representation and byte classes.

The API is simpler in that there is now just one type that encapsulates
everything: `AhoCorasick`.

Support for streams has been improved quite a bit, with new APIs for
stream search & replace.

Test and benchmark coverage has increased quite a bit.

This also fixes a subtle but important bug: empty patterns are now
handled correctly. Previously, they could never match, but now they can
match at any position.

Finally, I believe this is now the only Aho-Corasick implementation to
support leftmost-first and leftmost-longest semantics by using what I
think is a novel alteration to the Aho-Corasick construction algorithm.
I surveyed some other implementations, and there are a few Java
libraries that support leftmost-longest match semantics, but they
implement it by adding a sliding queue at search time. I also looked
into Perl's regex implementation which has an Aho-Corasick optimization
for `foo|bar|baz|...|quux` style regexes, and therefore must somehow
implement leftmost-first semantics. The code is a bit hard to grok, but
it looks like this is being handled at search time as opposed to baking
it into the automaton.

Fixes #18, Fixes #19, Fixes #26, Closes #34
BurntSushi added a commit that referenced this issue Mar 28, 2019
This commit introduces a ground-up rewrite of the entire crate. Most or
all use cases served by `aho-corasick 0.6` should be served by this
rewrite as well. Pretty much everything has been improved. The API is
simpler, and much more flexible with many new configuration knobs for
controlling the space-vs-time tradeoffs of Aho-Corasick automatons. In
particular, there are several tunable optimizations for controlling
space usage such as state ID representation and byte classes.

The API is simpler in that there is now just one type that encapsulates
everything: `AhoCorasick`.

Support for streams has been improved quite a bit, with new APIs for
stream search & replace.

Test and benchmark coverage has increased quite a bit.

This also fixes a subtle but important bug: empty patterns are now
handled correctly. Previously, they could never match, but now they can
match at any position.

Finally, I believe this is now the only Aho-Corasick implementation to
support leftmost-first and leftmost-longest semantics by using what I
think is a novel alteration to the Aho-Corasick construction algorithm.
I surveyed some other implementations, and there are a few Java
libraries that support leftmost-longest match semantics, but they
implement it by adding a sliding queue at search time. I also looked
into Perl's regex implementation which has an Aho-Corasick optimization
for `foo|bar|baz|...|quux` style regexes, and therefore must somehow
implement leftmost-first semantics. The code is a bit hard to grok, but
it looks like this is being handled at search time as opposed to baking
it into the automaton.

Fixes #18, Fixes #19, Fixes #26, Closes #34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants