-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
high peak memory usage with parallel breadth first directory traversal #1550
Comments
Using a regex instead of a simple pattern doesn't seem to make a difference - this one behaves identically: |
Downloading all of crates.io takes 50Gb of disk space, and latest version of every crate uncompressed is 18Gb. If you don't want to download that, point me to a memory profiler and I can run it myself now that I have the corpus locally. |
I've actually been meaning to download all crates, since it makes for a nice test corpus. So I will investigate this. Thank you for making it easy. Without investigating, I'm pretty sure I know the culprit. If I'm right, then it's less of a leak and more of a conscious trade off. Namely, ripgrep must buffer enough room to store at least one line contiguously in memory. If you run across a very long line, then the buffer expands to that capacity. But it will never shrink itself. ripgrep uses one of these buffers per thread. So if each buffer has to expand to a large size, then they never contract. While contracting won't reduce per-thread peak memory usage, it could reduce the overall process peak memory usage in most cases. With that said, there is no particular reason why the behavior must be this way. It's just that I've never had a good test case for it. Some kind of contraction is probably wise. Although whether contraction actually results in memory being freed back to the OS is perhaps a separate issue that I suspect is mostly out of the control of ripgrep. But yeah, I'll do some investigation to confirm my hypothesis. Thanks for the report! |
OK, I did some experimenting and my guess was wrong. It turns out that this is a result of how ripgrep implements parallel directory traversal. Namely, since it uses a queue to communicate new work among threads, this results in a breadth first traversal. In this case, the initial directory is extremely wide, because it contains the latest version of every crate. Upon entering each directory, a matcher for processing gitignore rules is created. Since this is breadth first, a gitignore matcher is created for every directory before ripgrep ever starts to process a single file. This is primarily what contributes to a large peak memory usage. You can test this hypothesis by searching without parallelism (with Fixing this is unfortunately not quite so easy. @bmalehorn was actually barking up the right tree a few years ago, and even submitted a PR to switch the traversal to depth first in #450. But the upsides didn't seem so apparent to me then, and it appeared to change the output order a bit too much for my taste. But with this example in front of me, it seems clearer that switching to a depth first traversal is better than what we have today. Since then though, Crossbeam seems to no longer provide a thread safe stack. One can be implemented without much code, but I'm hesitant to add So I think this bug is blocked on either rethinking how traversal works (I've always wanted to use rayon) or switching to using a depth first traversal with minimal risk from |
I've consulted my friendly neighbourhood good of concurrency, Acrimon, and their recommendation for a concurrent stack was Listing all files up front before creating the gitignore handlers would solve this, but also create a different kind of scalability limit with all the file paths being loaded in memory at once, and I'm not sure how that interacts with filesystem cache behavior. |
A Listing all files up front is another way of looking at, but is a major architectural change and it is difficult for me to reason about its performance characteristics. |
This replaces the use of channels in the parallel directory traversal with a simple stack. The primary motivation for this change is to reduce peak memory usage. In particular, when using a channel (which is a queue), we wind up visiting files in a breadth first fashion. Using a stack switches us to a depth first traversal. While there are no real intrinsic differences, depth first traversal generally tends to use less memory because directory trees are more commonly wide than they are deep. In particular, the queue/stack size itself is not the only concern. In one recent case documented in #1550, a user wanted to search all Rust crates. The directory structure was shallow but extremely wide, with a single directory containing all crates. This in turn results is in descending into each of those directories and building a gitignore matcher for each (since most crates have `.gitignore` files) before ever searching a single file. This means that ripgrep has all such matchers in memory simultaneously, which winds up using quite a bit of memory. In a depth first traversal, peak memory usage is much lower because gitignore matches are built and discarded more quickly. In the case of searching all crates, the peak memory usage decrease is dramatic. On my system, it shrinks by an order magnitude, from almost 1GB to 50MB. The decline in peak memory usage is consistent across other use cases as well, but is typically more modest. For example, searching the Linux repo has a 50% decrease in peak memory usage and searching the Chromium repo has a 25% decrease in peak memory usage. Search times generally remain unchanged, although some ad hoc benchmarks that I typically run have gotten a bit slower. As far as I can tell, this appears to be result of scheduling changes. Namely, the depth first traversal seems to result in searching some very large files towards the end of the search, which reduces the effectiveness of parallelism and makes the overall search take longer. This seems to suggest that a stack isn't optimal. It would instead perhaps be better to prioritize searching larger files first, but it's not quite clear how to do this without introducing more overhead (getting the file size for each file requires a stat call). Fixes #1550
This replaces the use of channels in the parallel directory traversal with a simple stack. The primary motivation for this change is to reduce peak memory usage. In particular, when using a channel (which is a queue), we wind up visiting files in a breadth first fashion. Using a stack switches us to a depth first traversal. While there are no real intrinsic differences, depth first traversal generally tends to use less memory because directory trees are more commonly wide than they are deep. In particular, the queue/stack size itself is not the only concern. In one recent case documented in #1550, a user wanted to search all Rust crates. The directory structure was shallow but extremely wide, with a single directory containing all crates. This in turn results is in descending into each of those directories and building a gitignore matcher for each (since most crates have `.gitignore` files) before ever searching a single file. This means that ripgrep has all such matchers in memory simultaneously, which winds up using quite a bit of memory. In a depth first traversal, peak memory usage is much lower because gitignore matches are built and discarded more quickly. In the case of searching all crates, the peak memory usage decrease is dramatic. On my system, it shrinks by an order magnitude, from almost 1GB to 50MB. The decline in peak memory usage is consistent across other use cases as well, but is typically more modest. For example, searching the Linux repo has a 50% decrease in peak memory usage and searching the Chromium repo has a 25% decrease in peak memory usage. Search times generally remain unchanged, although some ad hoc benchmarks that I typically run have gotten a bit slower. As far as I can tell, this appears to be result of scheduling changes. Namely, the depth first traversal seems to result in searching some very large files towards the end of the search, which reduces the effectiveness of parallelism and makes the overall search take longer. This seems to suggest that a stack isn't optimal. It would instead perhaps be better to prioritize searching larger files first, but it's not quite clear how to do this without introducing more overhead (getting the file size for each file requires a stat call). Fixes #1550
What version of ripgrep are you using?
ripgrep 12.0.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)
How did you install ripgrep?
cargo install ripgrep
onrustc 1.42.0 (b8cedc004 2020-03-09)
What operating system are you using ripgrep on?
Linux, derived from Ubuntu 18.04
Describe your question, feature request, or bug.
rg seems to leak memory when searching for an expression in latest versions of all crates from crates.io. The memory usage gradually increases and peaks at around 1Gb.
If this is a bug, what are the steps to reproduce the behavior?
where decompress-one-crate.sh is
Sample
rg
invocation exhibiting the behavior is:If this is a bug, what is the actual behavior?
Memory usage increases as the search progresses, reaching as high as 1Gb. It increases gradually and accumulates.
If this is a bug, what is the expected behavior?
Memory usage should not accumulate
The text was updated successfully, but these errors were encountered: