giant rollup #1486

BurntSushi · 2020-02-15T13:35:48Z

This PR merges a number of outstanding PRs. In many cases, I've added touched up PRs in order to merge them. (e.g., Writing commit messages, adding changelog entries, etc.)

This commit adds a simple `.exists()` check for `.gitignore`, `.ignore`, and other similar files before actually calling `File::open(…)` in `GitIgnoreBuilder::add`. The reason is that a simple existence check via `stat` can be faster than actually trying to `open` the file, see https://stackoverflow.com/a/12774387/704831. As we typically expect(?) the number of directories *without* ignore files to be much larger than the number of directories *with* ignore files, this leads to an overall speedup. The performance gain is not huge for `rg`, but can be quite significant if more `.gitignore`-like files are added via `add_custom_ignore_filename`. The speedup is *larger* for folders with *low* files-per-directory ratios. Note though that we do not do this check on Windows until a specific analysis there suggests this is beneficial. Namely, Windows generally has slower file system operations, so it's not clear whether this speculative check is actually a benefit or not. Benchmark results ----------------- `rg --files` in my home folder (200k results, 6.5 files per directory): | Command | Mean [ms] | Min [ms] | Max [ms] | Relative | |:---|---:|---:|---:|---:| | `./rg-master --files` | 396.4 ± 3.2 | 390.9 | 400.0 | 1.05 | | `./rg-feature --files` | 376.0 ± 3.6 | 369.3 | 383.5 | 1.00 | `rg --files --hidden` in my home folder (800k results, 5.4 files per directory) | Command | Mean [s] | Min [s] | Max [s] | Relative | |:---|---:|---:|---:|---:| | `./rg-master --files --hidden` | 1.575 ± 0.012 | 1.560 | 1.597 | 1.06 | | `./rg-feature --files --hidden` | 1.479 ± 0.011 | 1.464 | 1.496 | 1.00 | `rg --files` in the chromium-79.0.3915.2 source tree (300k results, 12.7 files per directory) | Command | Mean [ms] | Min [ms] | Max [ms] | Relative | |:---|---:|---:|---:|---:| | `~/rg-master --files` | 445.2 ± 5.3 | 435.6 | 453.0 | 1.04 | | `~/rg-feature --files` | 428.9 ± 7.0 | 418.2 | 440.0 | 1.00 | `rg --files` in the linux-5.3 source tree (65k results, 15.1 files per directory) | Command | Mean [ms] | Min [ms] | Max [ms] | Relative | |:---|---:|---:|---:|---:| | `./rg-master --files` | 94.5 ± 1.9 | 89.8 | 98.5 | 1.02 | | `./rg-feature --files` | 92.6 ± 2.7 | 88.4 | 98.7 | 1.00 | Closes #1381

I'm surprised this wasn't caught until now, but if a test directory already exists, then it was reused. This can result in hard to debug problems with tests when, e.g., file names are changed and a recursive search is executed.

--context-separator='' still adds a new line separator, which could still potentially be useful. So we add a new `--no-context-separator` flag that completely disables context separators even when the -A/-B/-C context flags are used. Closes #1390

This includes .patch and .diff files. Fixes #1418, Closes #1419

Closes #1426

`benches/bench.rs` uses lazy_static but Cargo.toml does not declare a dependency on it. This causes rustc to use its own internal private copy instead. Sometimes this causes unintuitive errors like this Debian bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=942243 The underlying issue is https://github.com/rust-lang/rust#27812 but it can be avoided by explicitly declaring the dependency, which you are supposed to do anyways. Closes #1435

This is for RPM package SPEC files. Fixes #946, Closes #1449

This flag, when used in conjunction with --count or --count-matches, will print a result for each file searched even if there were zero matches in that file. This is off by default but can be enabled to make ripgrep behave more like grep. This also clarifies some of the defaults for the grep-printer::SummaryBuilder type. Closes #1370, Closes #1405

Closes #1436

I improved the help documentation in the following manner and for the following reasons: 1. It's only logical to put the default sub-option on the first possible line, as well as to separately mention that it is indeed the default sub-option. 2. Additional options for the flags should describe the main points of their purpose without requiring user to read the whole help entry. In my opinion, the information sub-options' influence on multi-threading and speed are important enough to warrant their inclusion in each sub-option's description line text. Closes #1434

This adds it to the guide and the docs for the --type flag. Fixes #1344, Closes #1472

This fixes an interesting performance bug where the inner literal extractor would sometimes choose a sub-optimal literal. For example, consider the regex: \x20+Sherlock Holmes\x20+ (The `\x20` is the ASCII code for a space character, which we use here to just make it clearer. It otherwise does not matter.) Previously, this would see the initial \x20 and then stop collecting literals after the `+` repetition operator. This was because the inner literal detector was adapter from the prefix literal detector, which had to stop here. Namely, while \x20S would be a valid prefix (for example), \x20\x20S would also be a valid prefix. As would \x20\x20\x20S and so on. So the prefix detector would have to stop at the repetition operator. Otherwise, only searching for \x20S could potentially scan farther then the starting position of the next match. However, for inner literals, this calculus no longer makes sense. We can freely search for, e.g., \x20S without missing matches that start with \x20\x20S precisely because we know this is an inner literal which may not correspond to the start of a match. With this fix, the literal that is now detected is \x20Sherlock Holmes\x20 Which is much better. We achieve this by no longer "cutting" literals after seeing a `+` repetition operator. Instead, we permit literals to continue to be extended. The reason why this is important is because using \x20 as the literal to search for is generally bad juju since it is so common. In fact, we should probably add more logic here to either avoid such things or give up entirely on the inner literal optimization if it detected a literal that we think is very common. But we punt on such things here.

When the -w/--word-regexp was used, ripgrep would in many cases fail to apply literal optimizations. This occurs specifically when the regex given by the user is an alternation of literals with no common prefixes or suffixes, e.g., rg -w 'foo|bar|baz|quux' In this case, the inner literal detector fails. Normally, this would result in literal prefixes being detected by the regex engine. But because of the -w/--word-regexp flag, the actual regex that we run ends up looking like this: (^|\W)(foo|bar|baz|quux)($|\W) which of course defeats any prefix or suffix literal optimizations in the regex crate's somewhat naive extractor. (A better extractor could still do literal optimizations in the above case.) So this commit fixes this by falling back to prefix or suffix literals when they're available instead of prematurely giving up and assuming the regex engine will do the rest.

Previously, ripgrep would always defer to the regex engine's capturing matches in order to implement word matching. Namely, ripgrep would determine the correct match offsets via a capturing group, since the word regex is itself generated from the user supplied regex. Unfortunately, the regex engine's capturing mode is still fairly slow, so this commit adds a fast path to avoid capturing mode in the vast majority of cases. See comments in the code for details.

This exposes the underlying `Glob` used to compile the matcher. This can be useful for wrapping up the glob matcher in other types. Closes #1454

The `globset::Glob` type [`new`] function creates a new value with an `&str` parameter which returns an `Result<Glob, Error>` object. This is exactly what [`std::str::FromStr::from_str`][`std::str::FromStr`] defines. Libraries like [`clap`] use [`std::str::FromStr`] to create objects from provided commandline arguments. This change makes this library usable without a newtype wrapper. [`std::str::FromStr`]: https://doc.rust-lang.org/std/str/trait.FromStr.html [`clap`]: https://docs.rs/clap/2.33.0/clap/macro.value_t.html [`new`]: https://docs.rs/globset/0.4.4/globset/struct.Glob.html#method.new Closes #1447

This commit adds a new --no-ignore-exclude flag that permits disabling the use of .git/info/exclude filtering. Local exclusions are manual configurations to a repository and are not shared, so it is sometimes useful to disable to get a consistent view of a repository. This also adds a new section to the man page that describes automatic filtering. Closes #1420

It turns out that querying the CWD while in a directory that no longer exists results in an error. Since the CWD is queried every time ripgrep starts---whether it needs it or not---for dealing with glob matching, ripgrep winds up being completely useless inside a non-existent directory. We fix this in a few different ways: * Firstly, if std::env::current_dir() fails, then we fall back to trying to read the `PWD` environment variable. * If that fails, that we return a more sensible error message so that a user can at least react to the problem. Previously, the error message was inscrutable. * Finally, we try to avoid the problem altogether by building empty glob matchers if not globs were provided, thus side-stepping querying the CWD completely. Fixes #1291, Closes #1400

Git looks for this file in GIT_COMMON_DIR, which is usually the same as GIT_DIR (.git). However, when searching inside a linked worktree, .git is usually a file that contains the path of the actual git dir, which in turn contains a file "commondir" which references the directory where info/exclude may reside, alongside other configuration shared across all worktrees. This directory is usually the git dir of the main worktree. Unlike git this does *not* read environment variables GIT_DIR and GIT_COMMON_DIR, because it is not clear how to interpret them when searching multiple repositories. Fixes #1445, Closes #1446

This makes it so the caller can more easily refactor from single-threaded to multi-threaded walking. If they want to support both, this makes it easier to do so with a single initialization code-path. In particular, it side-steps the need to put everything into an `Arc`. This is not a breaking change because it strictly increases the number of allowed inputs to `WalkParallel::run`. Closes #1410, Closes #1432

This changes ripgrep to use ignore's new support for borrowing data when walking in parallel.

On top of the parallel-walk's closures, this provides a Visitor API. This clarifies the role of the two different closures in the `run` API and allows implementing of `Drop` for post-processing once traversal is finished. The closure API is maintained not just for compatibility but also convinience for simple cases. Fixes #469, Closes #1430

This appears to be another transcription bug from copying this code from the prefix literal detection from inside the regex crate. Namely, when it comes to inner literals, we only want to treat counted repetition as two separate cases: the case when the minimum match is 0 and the case when the minimum match is more than 0. In the former case, we treat `e{0,n}` as `e*` and in the latter we treat `e{m,n}` where `m >= 1` as just `e`. We could definitely do better here. e.g., This means regexes like `(foo){10}` will only have `foo` extracted as a literal, where searching for the full literal would likely be faster. The actual bug here was that we were not implementing this logic correctly. Namely, we weren't always "cutting" the literals in the second case to prevent them from being expanded. Fixes #1319, Closes #1367

This commit fixes an inconsistency between the serial and the parallel directory walkers around visiting a directory for which the user holds insufficient permissions to descend into. The serial walker does produce a successful entry for a directory that it cannot descend into due to insufficient permissions. However, before this change that has not been the case for the parallel walker, which would produce an `Err` item not only when descending into a directory that it cannot read from but also for the directory entry itself. This change brings the behaviour of the parallel variant in line with that of the serial one. Fixes #1346, Closes #1365

It looks like `completions` is owned by Fish itself. Third party completions should go in `vendor_completions.d`. Fixes #1485

This flag prevents ripgrep from requiring one to search a git repository in order to respect git-related ignore rules (global, .gitignore and local excludes). This actually corresponds to behavior ripgrep had long ago, but #934 changed that. It turns out that users were relying on this buggy behavior. In most cases, fixing it as simple as converting one's rules to .ignore or .rgignore files. Unfortunately, there are other use cases---like Perforce automatically respecting .gitignore files---that make a strong case for ripgrep to at least support this. The UX of a flag like this is absolutely atrocious. It's so obscure that it's really not worth explicitly calling it out anywhere. Moreover, the error cases that occur when this flag isn't used (but its behavior is desirable) will not be intuitive, do not seem easily detectable and will not guide users to this flag. Nevertheless, the motivation for this is just barely strong enough for me to begrudgingly accept this. Fixes #1414, Closes #1416

This adds a universal --no-unicode flag that is intended to work for all supported regex engines. There is no point in retaining --no-pcre2-unicode, so we make them aliases to the new flags and deprecate them.

Due to how walkdir works if symlinks are not followed, symlinks to directories are seen as simple files by ripgrep. This caused a panic in some cases due to receiving a WalkEvent::Exit event without a corresponding WalkEvent::Dir event. This is fixed by looking at the metadata of the file in the case of a symlink to determine if it's a directory. We are careful to only do this stat check when the depth of the entry is 0, as this bug only impacts us when 1) we aren't following symlinks generally and 2) the user provides a symlinked directory that we do follow as a top-level path to search. Fixes #1389, Closes #1397

Change the meaning of `Quit` message. Now it means terminate. The final "dance" is unnecessary, because by the time quitting begins, no thread will ever spawn a new `Work`. The trick was to replace the heuristic spin-loop with blocking receive. Closes #1337

BurntSushi · 2020-02-17T22:28:22Z

w00t! 🎉

BurntSushi force-pushed the ag/rollup branch from 15a7e50 to 2c2a2da Compare February 15, 2020 16:04

sharkdp and others added 11 commits February 15, 2020 17:28

tests: remove existing test directory

f398d12

I'm surprised this wasn't caught until now, but if a test directory already exists, then it was reused. This can result in hard to debug problems with tests when, e.g., file names are changed and a recursive search is executed.

cli: add --no-context-separator flag

f4bb695

--context-separator='' still adds a new line separator, which could still potentially be useful. So we add a new `--no-context-separator` flag that completely disables context separators even when the -A/-B/-C context flags are used. Closes #1390

ignore/types: add 'diff' file type

89cfcb3

This includes .patch and .diff files. Fixes #1418, Closes #1419

changelog: add notes about new file types

9e19f37

ignore/types: add xhtml to xml file type

b01091f

Closes #1426

ignore/types: add spec file type

f47efc1

This is for RPM package SPEC files. Fixes #946, Closes #1449

readme: simplify openSUSE instructions

ce1b921

Closes #1436

BurntSushi force-pushed the ag/rollup branch from efdc041 to 5b5a4fc Compare February 15, 2020 22:43

doc: document all file type

169d2e4

This adds it to the guide and the docs for the --type flag. Fixes #1344, Closes #1472

BurntSushi force-pushed the ag/rollup branch from 5b5a4fc to 169d2e4 Compare February 16, 2020 10:05

BurntSushi and others added 6 commits February 16, 2020 10:34

globset: add GlobMatcher::glob

3b1d1b3

This exposes the underlying `Glob` used to compile the matcher. This can be useful for wrapping up the glob matcher in other types. Closes #1454

BurntSushi force-pushed the ag/rollup branch from 1cefd54 to 17f90e3 Compare February 17, 2020 01:39

BurntSushi and others added 2 commits February 16, 2020 21:03

BurntSushi mentioned this pull request Feb 17, 2020

ignore: use git commondir for sourcing .git/info/exclude #1446

Closed

BurntSushi and others added 5 commits February 16, 2020 22:24

changelog: highlight a bad performance regression

b837c37

core: simplify parallel walking using borrows

b967ad1

This changes ripgrep to use ignore's new support for borrowing data when walking in parallel.

BurntSushi force-pushed the ag/rollup branch from de0c222 to 9fd5b95 Compare February 17, 2020 18:16

jakubadamw and others added 3 commits February 17, 2020 13:31

deb: fix fish completion install location

bb49e6f

It looks like `completions` is owned by Fish itself. Third party completions should go in `vendor_completions.d`. Fixes #1485

BurntSushi mentioned this pull request Feb 17, 2020

[RFC] ripgrep: expose custom ignore files via --ignore-file-name #1416

Closed

BurntSushi and others added 3 commits February 17, 2020 15:34

cli: add --no-unicode, deprecate --no-pcre2-unicode

4452eb1

This adds a universal --no-unicode flag that is intended to work for all supported regex engines. There is no point in retaining --no-pcre2-unicode, so we make them aliases to the new flags and deprecate them.

BurntSushi merged commit 3d59bd9 into master Feb 17, 2020

BurntSushi deleted the ag/rollup branch February 17, 2020 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

giant rollup #1486

giant rollup #1486

BurntSushi commented Feb 15, 2020

BurntSushi commented Feb 17, 2020 •

edited

Loading

giant rollup #1486

giant rollup #1486

Conversation

BurntSushi commented Feb 15, 2020

BurntSushi commented Feb 17, 2020 • edited Loading

BurntSushi commented Feb 17, 2020 •

edited

Loading