aho-corasick should be applied for cases like `\b(literal1|literal2|...|literalN)\b` #891

BurntSushi · 2022-07-19T12:45:40Z

Discussed in #890

^{Originally posted by Guillermogsjc July 19, 2022}
Hi, to match efficiently large amounts of alternations, I guess it is interesting to trigger aho_corasick variant here

regex/src/exec.rs

Line 91 in 9ca3099

/// An Aho-Corasick automaton with leftmost-first match semantics.

regarding doc:

/// This is only set when the entire regex is a simple unanchored
/// alternation of literals. We could probably use it more circumstances,
/// but this is already hacky enough in this architecture.

The question is: is there any way to use word boundaries in such a way this expression is highly optimized for a thing like this?

r"\b(a|... #massive ammount of literal alternations here# ...|z)\b"

or with (?-u:\b) instead of \b .

And... regarding PERFORMANCE documentation here

there is no problem with using non-greedy matching or having lots of alternations in your regex

this previously stated regex would be in the set of "no problem" ?

Thanks

The text was updated successfully, but these errors were encountered:

BurntSushi · 2022-07-19T12:48:58Z

I think the code comment is somewhat clear that the Aho-Corasick optimization only applies in some very limited cases today. Ideally, yes, we should absolutely enable it for the case of \b(literal1|literal2|...|literalN)\b. However, I do not want to add any more subtle optimizations (and this would be one example) to the regex crate until #656 is done. The current infrastructure is just too brittle.

To work around this today, I think you have a couple of options available to you:

Use the aho-corasick crate directly. When you find a match, then that it's surrounded by word boundaries manually.
Use regex-automata. You'll have to use (?-u:\b) and the regex will take longer to build than what aho-corasick would do, but it should be about as fast.
Don't worry about the Aho-Corasick optimization at all. Use (?-u:\b). In that case, this crate's lazy DFA will be used, which is likely to be as fast as Aho-Corasick in practice. However, this really depends on the number of literals you have here.

BurntSushi · 2022-07-19T13:34:10Z

This example demonstrates why this optimization is subtle, and more generally, the difficulty of composition with respect to regex internals:

$ regex-cli count matches nfa thompson pikevm --matches 'samwise' '(?:sam|samwise)'
build pike vm time:  29.693µs
 create cache time:  658ns
        cache size:  800
       search time:  4.221µs
            counts:  [1]

PatternID(0): [0, 3)

$ regex-cli count matches nfa thompson pikevm --matches 'samwise' '\b(?:sam|samwise)\b'
build pike vm time:  26.135µs
 create cache time:  585ns
        cache size:  896
       search time:  11.146884ms
            counts:  [1]

PatternID(0): [0, 7)

The key thing here is that you can't just try to match the inner (?:sam|samwise) without regard for the surrounding word boundaries, because the surrounding \b's might change which thing gets reported.

Here are some ideas (mostly written with a future me as a target audience, so apologies if they appear cryptic):

Normally, Aho-Corasick is used with leftmost-first or "preference order" match semantics. This way, it is purely consistent with the match semantics of the regex crate. But in this case, maybe we need to use standard match semantics and execute an overlapping search. That way, we can be sure that we're guaranteed to see every possible match. So for example, this code:

use aho_corasick::{MatchKind, AhoCorasickBuilder};

fn main() {
    let ac = AhoCorasickBuilder::new()
        .match_kind(MatchKind::Standard)
        .build(&["sam", "samwise"]);
    for m in ac.find_overlapping_iter("samwise") {
        println!("{:?}", (m.pattern(), m.start()..m.end()));
    }
}

outputs:

(0, 0..3)
(1, 0..7)

So if we iterated over those matches and then tried to test whether the surrounding word boundary assertions held and reported the first one that did, we'd get the right answer here.

Unfortunately, I don't think this works in all cases. Consider this one:

use aho_corasick::{MatchKind, AhoCorasickBuilder};

fn main() {
    let ac = AhoCorasickBuilder::new()
        .match_kind(MatchKind::Standard)
        .build(&["sam the wise", "the"]);
    for m in ac.find_overlapping_iter("sam the wise") {
        println!("{:?}", (m.pattern(), m.start()..m.end()));
    }
}

outputs:

(1, 4..7)
(0, 0..12)

So if we applied our idea from above, we'd report the as the match since it's the first one reported by Aho-Corasick where the word boundary assertions are satisfied. But this would be the wrong answer:

$ regex-cli count matches nfa thompson pikevm --matches 'sam the wise' '\b(?:sam the wise|the)\b'
build pike vm time:  32.37µs
 create cache time:  601ns
        cache size:  1136
       search time:  11.419901ms
            counts:  [1]

PatternID(0): [0, 12)

The right answer is sam the wise because of leftmost-first or "preference order" matching.

I think the reason why this case fails is because \b matches at a location other then the start or end of sam the wise. Thus, there is an opportunity for a substring in sam the wise to match even if preference should be given to sam the wise.

More generally, my intuition is that for any regex of the form \b(?:re)\b, we can match re independently and then test the word boundary assertions as long as the following holds:

We report every possible match of re, including overlapping matches.
For all strings s in the language defined by re, it must be true that \b matches at s[0], s[s.len()-1], both or neither, but nowhere else.

Probably this can be generalized even further to re-left(?:re-inner)re-right.

I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` | --- ### Release Notes <details> <summary>rust-lang/regex (regex)</summary> ### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07) [Compare Source](rust-lang/regex@1.9.0...1.9.1) \================== This is a patch release which fixes a memory usage regression. In the regex 1.9 release, one of the internal engines used a more aggressive allocation strategy than what was done previously. This patch release reverts to the prior on-demand strategy. Bug fixes: - [BUG #1027](rust-lang/regex#1027): Change the allocation strategy for the backtracker to be less aggressive. ### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05) [Compare Source](rust-lang/regex@1.8.4...1.9.0) \================== This release marks the end of a [years long rewrite of the regex crate internals](rust-lang/regex#656). Since this is such a big release, please report any issues or regressions you find. We would also love to hear about improvements as well. In addition to many internal improvements that should hopefully result in "my regex searches are faster," there have also been a few API additions: - A new `Captures::extract` method for quickly accessing the substrings that match each capture group in a regex. - A new inline flag, `R`, which enables CRLF mode. This makes `.` match any Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and `(?m:$)` match after and before both `\r` and `\n`, respectively, but never between a `\r` and `\n`. - `RegexBuilder::line_terminator` was added to further customize the line terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte. - The `std` Cargo feature is now actually optional. That is, the `regex` crate can be used without the standard library. - Because `regex 1.9` may make binary size and compile times even worse, a new experimental crate called `regex-lite` has been published. It prioritizes binary size and compile times over functionality (like Unicode) and performance. It shares no code with the `regex` crate. New features: - [FEATURE #244](rust-lang/regex#244): One can opt into CRLF mode via the `R` flag. e.g., `(?mR:$)` matches just before `\r\n`. - [FEATURE #259](rust-lang/regex#259): Multi-pattern searches with offsets can be done with `regex-automata 0.3`. - [FEATURE #476](rust-lang/regex#476): `std` is now an optional feature. `regex` may be used with only `alloc`. - [FEATURE #644](rust-lang/regex#644): `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave. - [FEATURE #675](rust-lang/regex#675): Anchored search APIs are now available in `regex-automata 0.3`. - [FEATURE #824](rust-lang/regex#824): Add new `Captures::extract` method for easier capture group access. - [FEATURE #961](rust-lang/regex#961): Add `regex-lite` crate with smaller binary sizes and faster compile times. - [FEATURE #1022](rust-lang/regex#1022): Add `TryFrom` implementations for the `Regex` type. Performance improvements: - [PERF #68](rust-lang/regex#68): Added a one-pass DFA engine for faster capture group matching. - [PERF #510](rust-lang/regex#510): Inner literals are now used to accelerate searches, e.g., `\w+@\w+` will scan for `@`. - [PERF #787](rust-lang/regex#787), [PERF #891](rust-lang/regex#891): Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`. (There are many more performance improvements as well, but not all of them have specific issues devoted to them.) Bug fixes: - [BUG #429](rust-lang/regex#429): Fix matching bugs related to `\B` and inconsistencies across internal engines. - [BUG #517](rust-lang/regex#517): Fix matching bug with capture groups. - [BUG #579](rust-lang/regex#579): Fix matching bug with word boundaries. - [BUG #779](rust-lang/regex#779): Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`. - [BUG #850](rust-lang/regex#850): Fix matching bug inconsistency between NFA and DFA engines. - [BUG #921](rust-lang/regex#921): Fix matching bug where literal extraction got confused by `$`. - [BUG #976](rust-lang/regex#976): Add documentation to replacement routines about dealing with fallibility. - [BUG #1002](rust-lang/regex#1002): Use corpus rejection in fuzz testing. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957 Reviewed-by: crapStone <crapstone01@gmail.com> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>

BurntSushi added the enhancement label Jul 19, 2022

BurntSushi mentioned this issue Jul 19, 2022

Missed optimization for alterated word-lists — patterns like \b(foo|bar|baz|quux|...)\b #787

Closed

BurntSushi added the fix-incoming label Feb 14, 2023

BurntSushi closed this as completed in aa64e6d Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aho-corasick should be applied for cases like `\b(literal1|literal2|...|literalN)\b` #891

aho-corasick should be applied for cases like `\b(literal1|literal2|...|literalN)\b` #891

BurntSushi commented Jul 19, 2022

BurntSushi commented Jul 19, 2022 •

edited

Loading

BurntSushi commented Jul 19, 2022

aho-corasick should be applied for cases like \b(literal1|literal2|...|literalN)\b #891

aho-corasick should be applied for cases like \b(literal1|literal2|...|literalN)\b #891

Comments

BurntSushi commented Jul 19, 2022

Discussed in #890

BurntSushi commented Jul 19, 2022 • edited Loading

BurntSushi commented Jul 19, 2022

aho-corasick should be applied for cases like `\b(literal1|literal2|...|literalN)\b` #891

aho-corasick should be applied for cases like `\b(literal1|literal2|...|literalN)\b` #891

BurntSushi commented Jul 19, 2022 •

edited

Loading