crate overhaul #34

BurntSushi · 2018-08-27T13:29:30Z

I think this crate is due for overhaul. I think it has existed more or less in its current form for about three years now, but there are some things I'd like to add in order to lay the ground work for better regex optimizations. Here is what I'm thinking:

Switch to a more builder oriented API.
Get rid of the Automaton trait. Or at least, don't export it.
Fix the dense/sparse swap that has been driving me nuts (see Using of sparse and dense is backwards #18).
Refine the use of memchr. It is currently quite sub-optimal on non-ASCII text.
Employ ideas to make the automaton more memory efficient, or at least, make the trade off more easily configurable.
Add support for "leftmost first" and "leftmost longest" match semantics. This is an important piece for using aho-corasick effectively in a regex engine. We will retain the existing match semantics as well. I believe the leftmost semantics will require small tweaks to how the automaton is constructed (e.g., by causing match states to die instead of automatically wrapping around to start again).
Don't require the automaton to hold on to the strings (see Is it possible to throw away the strings? #19).
Try switching to criterion for benchmarks.

cc @Marwes

The text was updated successfully, but these errors were encountered:

Marwes · 2018-08-27T13:35:38Z

Employ ideas to make the automaton more memory efficient, or at least, make the trade off more easily configurable.

Guessing you mean this link (from the code comments) https://github.com/mischasan/aho-corasick ? Started reading the google doc linked from it and it looks rather intriguing.

BurntSushi · 2018-08-27T14:10:08Z

@Marwes Oh hmm I forgot about that. I wasn't necessarily thinking about that, but something perhaps more basic:

Permit more control over the depth of dense/sparse transition representations. (I think right now we use a dense representation for the first 3 levels and sparse for the rest, but is this actually best?)
Represent dense transitions more compactly by collapsing bytes into equivalence classes. This is similar to what the regex and csv crates do to save on space. e.g., If you have foo|bar|baz, then you have at most 7 equivalence classes: a, b, f, o, r, z and everything else. So you can have one 256 byte map that translates an arbitrary byte to one of the equivalence classes, and then the result of that is used in the transition lookup, which now has an alphabet of 7 instead of 256. The cost is an extra memory access though, and this can impact performance, so we probably want to make it configurable.

I don't want to completely over-exert myself here, because I have bigger plans I want to get to, but I'll give mischasan's work a glance see if we can at least make it possible to do that work later.

Marwes · 2018-08-27T14:21:32Z

Permit more control over the depth of dense/sparse transition representations. (I think right now we use a dense representation for the first 3 levels and sparse for the rest, but is this actually best?)

Looked into this briefly while optimizing but quickly moved on to lower hanging things. Of note is that bf693c0 got its large speedup precisely in the case where we use the sparse representation because that one is actually faster during construction (if there are few ok transitions). So if we only use AcAutomation to build a FullAcAutomaton it is quite possible that using only sparse states would improve the build time (despite having slower lookups).

Represent dense transitions more compactly by collapsing bytes into equivalence classes. This is similar to what the regex and csv crates do to save on space. e.g., If you have foo|bar|baz, then you have at most 7 equivalence classes: a, b, f, o, r, z and everything else. So you can have one 256 byte map that translates an arbitrary byte to one of the equivalence classes, and then the result of that is used in the transition lookup, which now has an alphabet of 7 instead of 256. The cost is an extra memory access though, and this can impact performance, so we probably want to make it configurable.

This is part of https://github.com/mischasan/aho-corasick at least. The extra memory access is unfortunate though perhaps the increased locality could weigh up for it 🤷‍♂️

BurntSushi · 2018-08-27T14:30:51Z

Yeah the increased locality is a good argument. It's been years at this point, but when I benchmarked the regex crate with and without the compression, the lack of compression ended up being faster. But that might have been on regexes small enough such that locality was optimal for both. Not sure.

BurntSushi · 2019-03-10T13:41:21Z

This is part of https://github.com/mischasan/aho-corasick at least. The extra memory access is unfortunate though perhaps the increased locality could weigh up for it man_shrugging

I have some evidence that the increase locality does indeed have some impact. One of my benchmarks in the rewrite builds an automaton with 5,000 English dictionary words and searches it against random text such that there are no matches. An automaton with byte classes is substantially faster than an automaton without byte classes:

random10x/leftmost-longest/dfa/nobyteclass-premultiply/5000words
                        time:   [568.86 us 569.41 us 569.96 us]
                        thrpt:  [167.34 MiB/s 167.50 MiB/s 167.66 MiB/s]

random10x/leftmost-longest/dfa/byteclass-premultiply/5000words
                        time:   [256.29 us 256.83 us 257.39 us]
                        thrpt:  [370.56 MiB/s 371.36 MiB/s 372.14 MiB/s]

In particular, the benchmark with byte classes appears to make much better use of my CPU's L1 cache, where the benchmark without byte classes leans more heavily on the LL cache.

perf stat output without byte classes:

 Performance counter stats for 'sh -c  cargo bench -- random10x/leftmost-longest/dfa/nobyteclass-premultiply --profile-time 3':

    11,313,342,091      instructions:u            #    0.79  insn per cycle           (16.63%)
                 0      context-switches:u
                 0      major-faults:u
           146,842      minor-faults:u
    14,366,559,369      cpu-cycles:u                                                  (22.21%)
       372,261,885      bus-cycles:u                                                  (27.75%)
    11,903,647,566      ref-cycles:u                                                  (33.28%)
   <not supported>      stalled-cycles-backend:u
   <not supported>      stalled-cycles-frontend:u
         1,495,669      L1-icache-load-misses:u                                       (33.28%)
     2,448,058,637      branches:u                                                    (33.28%)
        13,709,687      branch-misses:u           #    0.56% of all branches          (27.68%)
       165,361,967      cache-references:u                                            (27.68%)
        21,105,898      cache-misses:u            #   12.763 % of all cache refs      (22.17%)
                 0      mem-loads:u                                                   (22.23%)
       445,415,743      mem-stores:u                                                  (22.26%)
     2,912,568,497      L1-dcache-loads:u                                             (22.34%)
       428,791,475      L1-dcache-load-misses:u   #   14.72% of all L1-dcache hits    (22.39%)
   <not supported>      L1-dcache-prefetch-misses:u
       465,338,940      L1-dcache-stores:u                                            (22.36%)
   <not supported>      L1-dcache-store-misses:u
       114,685,132      LLC-loads:u                                                   (22.31%)
           705,539      LLC-load-misses:u         #    0.62% of all LL-cache hits     (22.23%)
   <not supported>      LLC-prefetch-misses:u
         5,416,693      LLC-stores:u                                                  (11.06%)
                 0      LLC-store-misses:u                                            (11.07%)

       3.958090628 seconds time elapsed

       3.782096000 seconds user
       0.165875000 seconds sys

perf stat with byte classes:

 Performance counter stats for 'sh -c  cargo bench -- random10x/leftmost-longest/dfa/byteclass-premultiply --profile-time 3':

    16,381,145,502      instructions:u            #    1.18  insn per cycle           (16.60%)
                 0      context-switches:u
                 0      major-faults:u
           146,816      minor-faults:u
    13,935,928,942      cpu-cycles:u                                                  (22.19%)
       370,263,509      bus-cycles:u                                                  (27.79%)
    11,860,659,686      ref-cycles:u                                                  (33.39%)
   <not supported>      stalled-cycles-backend:u
   <not supported>      stalled-cycles-frontend:u
         1,371,588      L1-icache-load-misses:u                                       (33.44%)
     3,694,703,211      branches:u                                                    (33.51%)
        12,426,884      branch-misses:u           #    0.34% of all branches          (27.94%)
        70,081,291      cache-references:u                                            (27.88%)
        12,288,777      cache-misses:u            #   17.535 % of all cache refs      (22.21%)
                 0      mem-loads:u                                                   (22.15%)
       431,712,749      mem-stores:u                                                  (22.22%)
     5,368,823,517      L1-dcache-loads:u                                             (22.24%)
       335,640,192      L1-dcache-load-misses:u   #    6.25% of all L1-dcache hits    (22.24%)
   <not supported>      L1-dcache-prefetch-misses:u
       424,523,253      L1-dcache-stores:u                                            (22.23%)
   <not supported>      L1-dcache-store-misses:u
        11,055,503      LLC-loads:u                                                   (22.12%)
         1,062,696      LLC-load-misses:u         #    9.61% of all LL-cache hits     (22.08%)
   <not supported>      LLC-prefetch-misses:u
         1,273,806      LLC-stores:u                                                  (11.07%)
                 0      LLC-store-misses:u                                            (11.07%)

       3.966444634 seconds time elapsed

       3.704133000 seconds user
       0.250810000 seconds sys

This is only one benchmark, and the results are likely sensitive to the size of the input and the size of the automaton, but I found this to be a decent piece of evidence.

This commit introduces a ground-up rewrite of the entire crate. Most or all use cases served by `aho-corasick 0.6` should be served by this rewrite as well. Pretty much everything has been improved. The API is simpler, and much more flexible with many new configuration knobs for controlling the space-vs-time tradeoffs of Aho-Corasick automatons. In particular, there are several tunable optimizations for controlling space usage such as state ID representation and byte classes. The API is simpler in that there is now just one type that encapsulates everything: `AhoCorasick`. Support for streams has been improved quite a bit, with new APIs for stream search & replace. Test and benchmark coverage has increased quite a bit. This also fixes a subtle but important bug: empty patterns are now handled correctly. Previously, they could never match, but now they can match at any position. Finally, I believe this is now the only Aho-Corasick implementation to support leftmost-first and leftmost-longest semantics by using what I think is a novel alteration to the Aho-Corasick construction algorithm. I surveyed some other implementations, and there are a few Java libraries that support leftmost-longest match semantics, but they implement it by adding a sliding queue at search time. I also looked into Perl's regex implementation which has an Aho-Corasick optimization for `foo|bar|baz|...|quux` style regexes, and therefore must somehow implement leftmost-first semantics. The code is a bit hard to grok, but it looks like this is being handled at search time as opposed to baking it into the automaton. Fixes #18, Fixes #19, Fixes #26, Closes #34

BurntSushi mentioned this issue Oct 7, 2018

Add experimental feature flagged no_std support #28

Closed

BurntSushi mentioned this issue Feb 16, 2019

Is it possible to throw away the strings? #19

Closed

BurntSushi mentioned this issue Mar 11, 2019

execute a regex on text streams rust-lang/regex#425

Open

BurntSushi mentioned this issue Mar 28, 2019

rewrite the entire crate #40

Merged

BurntSushi closed this as completed in #40 Mar 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crate overhaul #34

crate overhaul #34

BurntSushi commented Aug 27, 2018 •

edited

Loading

Marwes commented Aug 27, 2018

BurntSushi commented Aug 27, 2018 •

edited

Loading

Marwes commented Aug 27, 2018

BurntSushi commented Aug 27, 2018 •

edited

Loading

BurntSushi commented Mar 10, 2019

crate overhaul #34

crate overhaul #34

Comments

BurntSushi commented Aug 27, 2018 • edited Loading

Marwes commented Aug 27, 2018

BurntSushi commented Aug 27, 2018 • edited Loading

Marwes commented Aug 27, 2018

BurntSushi commented Aug 27, 2018 • edited Loading

BurntSushi commented Mar 10, 2019

BurntSushi commented Aug 27, 2018 •

edited

Loading

BurntSushi commented Aug 27, 2018 •

edited

Loading

BurntSushi commented Aug 27, 2018 •

edited

Loading