high peak memory usage with parallel breadth first directory traversal #1550

Shnatsel · 2020-04-12T21:50:32Z

What version of ripgrep are you using?

ripgrep 12.0.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

cargo install ripgrep on rustc 1.42.0 (b8cedc004 2020-03-09)

What operating system are you using ripgrep on?

Linux, derived from Ubuntu 18.04

Describe your question, feature request, or bug.

rg seems to leak memory when searching for an expression in latest versions of all crates from crates.io. The memory usage gradually increases and peaks at around 1Gb.

If this is a bug, what are the steps to reproduce the behavior?

# get all crates from crates.io
git clone https://github.com/the-lean-crate/criner && cd criner && cargo run --release -- mine
# get latest version of every crate
for dir in ./criner.db/assets/*/*/*/; do echo -n "$dir"; ls "$dir" | grep '\.crate' | sort -Vr | head -n 1; done > ./latest-versions
mkdir crates
cd crates
# decompress the latest version of every crate
cat ../latest-versions | parallel ../decompress-one-crate.sh

where decompress-one-crate.sh is

#!/bin/sh
cat "$1" | tar --gunzip -xf -

Sample rg invocation exhibiting the behavior is:

rg --iglob '*.rs' -iF '#[allow('

If this is a bug, what is the actual behavior?

Memory usage increases as the search progresses, reaching as high as 1Gb. It increases gradually and accumulates.

If this is a bug, what is the expected behavior?

Memory usage should not accumulate

The text was updated successfully, but these errors were encountered:

Shnatsel · 2020-04-12T21:55:05Z

Using a regex instead of a simple pattern doesn't seem to make a difference - this one behaves identically: rg --iglob '*.rs' -U "enum[^\\n]+\\{\\n[^\\n]+\\n\\}"

Shnatsel · 2020-04-12T22:02:25Z

Downloading all of crates.io takes 50Gb of disk space, and latest version of every crate uncompressed is 18Gb. If you don't want to download that, point me to a memory profiler and I can run it myself now that I have the corpus locally.

BurntSushi · 2020-04-12T22:10:44Z

I've actually been meaning to download all crates, since it makes for a nice test corpus. So I will investigate this. Thank you for making it easy.

Without investigating, I'm pretty sure I know the culprit. If I'm right, then it's less of a leak and more of a conscious trade off. Namely, ripgrep must buffer enough room to store at least one line contiguously in memory. If you run across a very long line, then the buffer expands to that capacity. But it will never shrink itself. ripgrep uses one of these buffers per thread. So if each buffer has to expand to a large size, then they never contract. While contracting won't reduce per-thread peak memory usage, it could reduce the overall process peak memory usage in most cases.

With that said, there is no particular reason why the behavior must be this way. It's just that I've never had a good test case for it. Some kind of contraction is probably wise. Although whether contraction actually results in memory being freed back to the OS is perhaps a separate issue that I suspect is mostly out of the control of ripgrep.

But yeah, I'll do some investigation to confirm my hypothesis. Thanks for the report!

BurntSushi · 2020-04-18T00:38:15Z

OK, I did some experimenting and my guess was wrong.

It turns out that this is a result of how ripgrep implements parallel directory traversal. Namely, since it uses a queue to communicate new work among threads, this results in a breadth first traversal. In this case, the initial directory is extremely wide, because it contains the latest version of every crate. Upon entering each directory, a matcher for processing gitignore rules is created. Since this is breadth first, a gitignore matcher is created for every directory before ripgrep ever starts to process a single file. This is primarily what contributes to a large peak memory usage. You can test this hypothesis by searching without parallelism (with -j1) or by disabling gitignore exclusions (with -u/--no-ignore). Both result in very small peak memory usage.

Fixing this is unfortunately not quite so easy. @bmalehorn was actually barking up the right tree a few years ago, and even submitted a PR to switch the traversal to depth first in #450. But the upsides didn't seem so apparent to me then, and it appeared to change the output order a bit too much for my taste. But with this example in front of me, it seems clearer that switching to a depth first traversal is better than what we have today.

Since then though, Crossbeam seems to no longer provide a thread safe stack. One can be implemented without much code, but I'm hesitant to add unsafe, especially when it's in an area of data structures that are outside my area of expertise.

So I think this bug is blocked on either rethinking how traversal works (I've always wanted to use rayon) or switching to using a depth first traversal with minimal risk from unsafe.

Shnatsel · 2020-04-18T01:42:44Z

I've consulted my friendly neighbourhood good of concurrency, Acrimon, and their recommendation for a concurrent stack was Mutex<Vec> because a Trieber stack is not meaningfully faster than that. Faster concurrent stack implementations are possible, but "no one is really insane enough to do this stuff".

Listing all files up front before creating the gitignore handlers would solve this, but also create a different kind of scalability limit with all the file paths being loaded in memory at once, and I'm not sure how that interacts with filesystem cache behavior.

BurntSushi · 2020-04-18T01:54:58Z

A Mutex<Vec<Work>> is indeed worth trying and shouldn't be too hard to do so. It is plausible that we won't be able to observe a performance difference.

Listing all files up front is another way of looking at, but is a major architectural change and it is difficult for me to reason about its performance characteristics.

This replaces the use of channels in the parallel directory traversal with a simple stack. The primary motivation for this change is to reduce peak memory usage. In particular, when using a channel (which is a queue), we wind up visiting files in a breadth first fashion. Using a stack switches us to a depth first traversal. While there are no real intrinsic differences, depth first traversal generally tends to use less memory because directory trees are more commonly wide than they are deep. In particular, the queue/stack size itself is not the only concern. In one recent case documented in #1550, a user wanted to search all Rust crates. The directory structure was shallow but extremely wide, with a single directory containing all crates. This in turn results is in descending into each of those directories and building a gitignore matcher for each (since most crates have `.gitignore` files) before ever searching a single file. This means that ripgrep has all such matchers in memory simultaneously, which winds up using quite a bit of memory. In a depth first traversal, peak memory usage is much lower because gitignore matches are built and discarded more quickly. In the case of searching all crates, the peak memory usage decrease is dramatic. On my system, it shrinks by an order magnitude, from almost 1GB to 50MB. The decline in peak memory usage is consistent across other use cases as well, but is typically more modest. For example, searching the Linux repo has a 50% decrease in peak memory usage and searching the Chromium repo has a 25% decrease in peak memory usage. Search times generally remain unchanged, although some ad hoc benchmarks that I typically run have gotten a bit slower. As far as I can tell, this appears to be result of scheduling changes. Namely, the depth first traversal seems to result in searching some very large files towards the end of the search, which reduces the effectiveness of parallelism and makes the overall search take longer. This seems to suggest that a stack isn't optimal. It would instead perhaps be better to prioritize searching larger files first, but it's not quite clear how to do this without introducing more overhead (getting the file size for each file requires a stat call). Fixes #1550

BurntSushi added bug A bug. question An issue that is lacking clarity on one or more points. labels Apr 12, 2020

BurntSushi mentioned this issue Apr 17, 2020

rg takes up all memory (>10GB) #1553

Closed

BurntSushi removed the question An issue that is lacking clarity on one or more points. label Apr 18, 2020

BurntSushi changed the title ~~Memory leak when searching all of crates.io~~ high peak memory usage with parallel breadth first directory traversal Apr 18, 2020

BurntSushi mentioned this issue Apr 18, 2020

crates/ignore: switch to depth first traversal #1554

Merged

BurntSushi closed this as completed in #1554 Apr 18, 2020

BurntSushi added a commit that referenced this issue May 9, 2020

changelog: add #1550 to CHANGELOG

b56315e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

high peak memory usage with parallel breadth first directory traversal #1550

high peak memory usage with parallel breadth first directory traversal #1550

Shnatsel commented Apr 12, 2020 •

edited

Loading

Shnatsel commented Apr 12, 2020

Shnatsel commented Apr 12, 2020

BurntSushi commented Apr 12, 2020 •

edited

Loading

BurntSushi commented Apr 18, 2020 •

edited

Loading

Shnatsel commented Apr 18, 2020

BurntSushi commented Apr 18, 2020

high peak memory usage with parallel breadth first directory traversal #1550

high peak memory usage with parallel breadth first directory traversal #1550

Comments

Shnatsel commented Apr 12, 2020 • edited Loading

What version of ripgrep are you using?

How did you install ripgrep?

What operating system are you using ripgrep on?

Describe your question, feature request, or bug.

If this is a bug, what are the steps to reproduce the behavior?

If this is a bug, what is the actual behavior?

If this is a bug, what is the expected behavior?

Shnatsel commented Apr 12, 2020

Shnatsel commented Apr 12, 2020

BurntSushi commented Apr 12, 2020 • edited Loading

BurntSushi commented Apr 18, 2020 • edited Loading

Shnatsel commented Apr 18, 2020

BurntSushi commented Apr 18, 2020

Shnatsel commented Apr 12, 2020 •

edited

Loading

BurntSushi commented Apr 12, 2020 •

edited

Loading

BurntSushi commented Apr 18, 2020 •

edited

Loading