Skip to content

Commit

Permalink
grep-regex: improve inner literal detection
Browse files Browse the repository at this point in the history
This fixes an interesting performance bug where the inner literal
extractor would sometimes choose a sub-optimal literal. For example,
consider the regex:

    \x20+Sherlock Holmes\x20+

(The `\x20` is the ASCII code for a space character, which we use here
to just make it clearer. It otherwise does not matter.)

Previously, this would see the initial \x20 and then stop collecting
literals after the `+` repetition operator. This was because the inner
literal detector was adapter from the prefix literal detector, which had
to stop here. Namely, while \x20S would be a valid prefix (for example),
\x20\x20S would also be a valid prefix. As would \x20\x20\x20S and so
on. So the prefix detector would have to stop at the repetition
operator. Otherwise, only searching for \x20S could potentially scan
farther then the starting position of the next match.

However, for inner literals, this calculus no longer makes sense. We can
freely search for, e.g., \x20S without missing matches that start with
\x20\x20S precisely because we know this is an inner literal which may
not correspond to the start of a match.

With this fix, the literal that is now detected is

    \x20Sherlock Holmes\x20

Which is much better. We achieve this by no longer "cutting" literals
after seeing a `+` repetition operator. Instead, we permit literals to
continue to be extended.

The reason why this is important is because using \x20 as the literal to
search for is generally bad juju since it is so common. In fact, we
should probably add more logic here to either avoid such things or give
up entirely on the inner literal optimization if it detected a literal
that we think is very common. But we punt on such things here.
  • Loading branch information
BurntSushi committed Feb 17, 2020
1 parent 24f8a3e commit ad97e9c
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 1 deletion.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ Performance improvements:

* [PERF #1381](https://github.com/BurntSushi/ripgrep/pull/1381):
Directory traversal is sped up with speculative ignore-file existence checks.
* PERF:
Improve inner literal detection to cover more cases more effectively.
e.g., ` +Sherlock Holmes +` now has ` Sherlock Holmes ` extracted instead
of ` `.

Feature enhancements:

Expand Down
1 change: 0 additions & 1 deletion grep-regex/src/literal.rs
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,6 @@ fn union_required(expr: &Hir, lits: &mut Literals) {
hir::RepetitionKind::ZeroOrMore => lits.cut(),
hir::RepetitionKind::OneOrMore => {
union_required(&x.hir, lits);
lits.cut();
}
hir::RepetitionKind::Range(ref rng) => {
let (min, max) = match *rng {
Expand Down

0 comments on commit ad97e9c

Please sign in to comment.