grep-regex: improve inner literal detection

This fixes an interesting performance bug where the inner literal extractor would sometimes choose a sub-optimal literal. For example, consider the regex: \x20+Sherlock Holmes\x20+ (The `\x20` is the ASCII code for a space character, which we use here to just make it clearer. It otherwise does not matter.) Previously, this would see the initial \x20 and then stop collecting literals after the `+` repetition operator. This was because the inner literal detector was adapter from the prefix literal detector, which had to stop here. Namely, while \x20S would be a valid prefix (for example), \x20\x20S would also be a valid prefix. As would \x20\x20\x20S and so on. So the prefix detector would have to stop at the repetition operator. Otherwise, only searching for \x20S could potentially scan farther then the starting position of the next match. However, for inner literals, this calculus no longer makes sense. We can freely search for, e.g., \x20S without missing matches that start with \x20\x20S precisely because we know this is an inner literal which may not correspond to the start of a match. With this fix, the literal that is now detected is \x20Sherlock Holmes\x20 Which is much better. We achieve this by no longer "cutting" literals after seeing a `+` repetition operator. Instead, we permit literals to continue to be extended. The reason why this is important is because using \x20 as the literal to search for is generally bad juju since it is so common. In fact, we should probably add more logic here to either avoid such things or give up entirely on the inner literal optimization if it detected a literal that we think is very common. But we punt on such things here.
BurntSushi · Feb 17, 2020 · ad97e9c · ad97e9c
1 parent 24f8a3e
commit ad97e9c
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,10 @@ Performance improvements:
 
 * [PERF #1381](https://github.com/BurntSushi/ripgrep/pull/1381):
   Directory traversal is sped up with speculative ignore-file existence checks.
+* PERF:
+  Improve inner literal detection to cover more cases more effectively.
+  e.g., ` +Sherlock Holmes +` now has ` Sherlock Holmes ` extracted instead
+  of ` `.
 
 Feature enhancements:
 

diff --git a/grep-regex/src/literal.rs b/grep-regex/src/literal.rs
@@ -146,7 +146,6 @@ fn union_required(expr: &Hir, lits: &mut Literals) {
                 hir::RepetitionKind::ZeroOrMore => lits.cut(),
                 hir::RepetitionKind::OneOrMore => {
                     union_required(&x.hir, lits);
-                    lits.cut();
                 }
                 hir::RepetitionKind::Range(ref rng) => {
                     let (min, max) = match *rng {