Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

walk: Use unbounded channels #1414

Closed
wants to merge 3 commits into from
Closed

Conversation

tavianator
Copy link
Collaborator

@tavianator tavianator commented Oct 30, 2023

Includes #1413. The relevant commit is 5200718:

We originally switched to bounded channels for backpressure to fix #918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults. When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by snmalloc, results are recycled by sending the boxed result
over a channel back to the thread that allocated it. By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically. And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes #1408.

I benchmarked this with bfs's benchmark suite. Both fds were built against a version of ignore that included BurntSushi/ripgrep@d938e95, so we won't actually see performance quite this good until a new ignore release happens. Still, the results are good enough IMO that this fixes #1408 and #1362.

Benchmark results (updated)

Complete traversal

linux v6.5 (86,380 files)

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/linux -false 20.2 ± 0.6 19.0 21.5 1.14 ± 0.11
find bench/corpus/linux -false 98.5 ± 0.2 98.1 98.9 5.58 ± 0.49
fd -u '^$' bench/corpus/linux 190.4 ± 47.0 133.0 231.3 10.78 ± 2.83
fd-master -u '^$' bench/corpus/linux 60.6 ± 15.7 29.5 70.5 3.43 ± 0.94
fd-unbounded -u '^$' bench/corpus/linux 17.7 ± 1.5 15.3 20.0 1.00

rust 1.72.1 (192,714 files)

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/rust -false 52.2 ± 1.7 50.2 56.4 1.55 ± 0.08
find bench/corpus/rust -false 313.5 ± 1.2 311.6 315.6 9.29 ± 0.40
fd -u '^$' bench/corpus/rust 274.6 ± 37.6 256.6 352.3 8.14 ± 1.17
fd-master -u '^$' bench/corpus/rust 55.7 ± 16.7 45.2 86.0 1.65 ± 0.50
fd-unbounded -u '^$' bench/corpus/rust 33.8 ± 1.4 31.6 36.3 1.00

chromium 119.0.6036.2 (2,119,292 files)

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/chromium -false 513.5 ± 11.4 490.6 528.1 2.06 ± 0.05
find bench/corpus/chromium -false 3285.0 ± 6.8 3275.2 3295.2 13.15 ± 0.14
fd -u '^$' bench/corpus/chromium 2538.8 ± 46.8 2476.3 2582.0 10.16 ± 0.22
fd-master -u '^$' bench/corpus/chromium 295.3 ± 17.6 264.9 307.4 1.18 ± 0.07
fd-unbounded -u '^$' bench/corpus/chromium 249.9 ± 2.6 246.5 254.8 1.00

Printing paths

Without colors

linux v6.5

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/linux 32.3 ± 1.6 27.5 34.8 1.03 ± 0.09
find bench/corpus/linux 103.0 ± 0.4 102.4 103.7 3.27 ± 0.24
fd -u --search-path bench/corpus/linux 192.0 ± 47.9 133.2 230.4 6.09 ± 1.58
fd-master -u --search-path bench/corpus/linux 87.1 ± 12.8 48.3 95.2 2.76 ± 0.45
fd-unbounded -u --search-path bench/corpus/linux 31.5 ± 2.3 29.5 40.2 1.00

With colors

linux v6.5

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/linux -color 208.7 ± 2.6 204.3 214.0 2.71 ± 0.09
fd -u --search-path bench/corpus/linux --color=always 185.3 ± 49.1 133.2 230.3 2.40 ± 0.64
fd-master -u --search-path bench/corpus/linux --color=always 95.9 ± 21.1 67.4 121.2 1.24 ± 0.28
fd-unbounded -u --search-path bench/corpus/linux --color=always 77.1 ± 2.4 74.0 81.6 1.00

Parallelism

rust 1.72.1

-j1

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j1 bench/corpus/rust -false 213.6 ± 0.5 212.8 214.4 1.00
fd -j1 -u '^$' bench/corpus/rust 277.2 ± 0.5 276.5 278.3 1.30 ± 0.00
fd-master -j1 -u '^$' bench/corpus/rust 283.4 ± 0.6 282.3 284.2 1.33 ± 0.00
fd-unbounded -j1 -u '^$' bench/corpus/rust 281.3 ± 0.5 280.6 282.1 1.32 ± 0.00

-j2

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j2 bench/corpus/rust -false 193.3 ± 1.0 191.5 195.2 1.24 ± 0.01
fd -j2 -u '^$' bench/corpus/rust 222.1 ± 5.3 216.5 231.8 1.42 ± 0.04
fd-master -j2 -u '^$' bench/corpus/rust 160.2 ± 1.5 158.3 162.7 1.03 ± 0.01
fd-unbounded -j2 -u '^$' bench/corpus/rust 155.9 ± 0.9 154.6 157.5 1.00

-j3

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j3 bench/corpus/rust -false 117.2 ± 6.2 108.7 125.6 1.05 ± 0.06
fd -j3 -u '^$' bench/corpus/rust 221.2 ± 2.5 217.1 223.7 1.99 ± 0.03
fd-master -j3 -u '^$' bench/corpus/rust 118.1 ± 2.4 112.9 121.0 1.06 ± 0.02
fd-unbounded -j3 -u '^$' bench/corpus/rust 111.2 ± 0.8 109.7 112.6 1.00

-j4

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j4 bench/corpus/rust -false 83.9 ± 4.1 77.1 89.6 1.00
fd -j4 -u '^$' bench/corpus/rust 231.4 ± 5.2 219.9 235.0 2.76 ± 0.15
fd-master -j4 -u '^$' bench/corpus/rust 95.4 ± 4.1 89.2 100.3 1.14 ± 0.07
fd-unbounded -j4 -u '^$' bench/corpus/rust 87.8 ± 1.1 85.7 89.5 1.05 ± 0.05

-j6

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j6 bench/corpus/rust -false 61.1 ± 1.4 58.1 63.8 1.00
fd -j6 -u '^$' bench/corpus/rust 230.6 ± 15.9 200.4 252.7 3.77 ± 0.27
fd-master -j6 -u '^$' bench/corpus/rust 74.0 ± 5.7 66.7 80.6 1.21 ± 0.10
fd-unbounded -j6 -u '^$' bench/corpus/rust 63.5 ± 0.9 61.8 64.8 1.04 ± 0.03

-j8

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j8 bench/corpus/rust -false 53.5 ± 2.2 50.1 57.4 1.04 ± 0.05
fd -j8 -u '^$' bench/corpus/rust 236.8 ± 13.2 222.3 259.0 4.61 ± 0.27
fd-master -j8 -u '^$' bench/corpus/rust 65.0 ± 7.5 57.0 73.2 1.27 ± 0.15
fd-unbounded -j8 -u '^$' bench/corpus/rust 51.4 ± 0.8 50.2 52.7 1.00

-j12

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j12 bench/corpus/rust -false 52.4 ± 2.4 47.9 57.0 1.27 ± 0.07
fd -j12 -u '^$' bench/corpus/rust 247.3 ± 13.7 230.0 268.8 5.99 ± 0.38
fd-master -j12 -u '^$' bench/corpus/rust 59.0 ± 12.0 46.9 73.3 1.43 ± 0.29
fd-unbounded -j12 -u '^$' bench/corpus/rust 41.3 ± 1.3 38.8 43.2 1.00

-j16

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j16 bench/corpus/rust -false 73.1 ± 5.5 65.9 83.5 2.06 ± 0.17
fd -j16 -u '^$' bench/corpus/rust 273.2 ± 10.3 246.7 280.3 7.69 ± 0.41
fd-master -j16 -u '^$' bench/corpus/rust 77.9 ± 0.9 76.3 80.0 2.19 ± 0.09
fd-unbounded -j16 -u '^$' bench/corpus/rust 35.5 ± 1.3 33.2 38.1 1.00

Details

Versions

$ bfs --version | head -n1
bfs 3.0.4
$ find --version | head -n1
find (GNU findutils) 4.9.0
$ fd --version
fd 8.7.1
$ fd-master --version
fd 8.7.1
$ fd-unbounded --version
fd 8.7.1

@tavianator
Copy link
Collaborator Author

An extra benchmark to justify closing #1408:

$ hyperfine -w2 fd{,-{master,after}}" -u . /tmp/empty"
Benchmark 1: fd -u . /tmp/empty
  Time (mean ± σ):     148.9 ms ±  12.1 ms    [User: 7.5 ms, System: 142.8 ms]
  Range (min … max):   129.4 ms … 165.2 ms    18 runs
 
Benchmark 2: fd-master -u . /tmp/empty
  Time (mean ± σ):      73.8 ms ±  10.0 ms    [User: 4.8 ms, System: 72.4 ms]
  Range (min … max):    57.3 ms …  83.9 ms    34 runs
 
Benchmark 3: fd-after -u . /tmp/empty
  Time (mean ± σ):       5.2 ms ±   0.7 ms    [User: 1.6 ms, System: 7.8 ms]
  Range (min … max):     1.8 ms …   7.2 ms    268 runs
 
Summary
  fd-after -u . /tmp/empty ran
   14.22 ± 2.71 times faster than fd-master -u . /tmp/empty
   28.68 ± 4.50 times faster than fd -u . /tmp/empty

@sharkdp sharkdp mentioned this pull request Nov 1, 2023
src/walk.rs Outdated Show resolved Hide resolved
@sharkdp
Copy link
Owner

sharkdp commented Nov 1, 2023

Wow 🤩

I would like to understand this first before merging. The situation is the following: we have a MPSC scenario where the consumer is sometimes too slow to handle the incoming results (even though its only job is to print the paths to a console). In the past, we used an unbounded channel which would lead to high memory usage because the channel was buffering all those results. Then we switched to a bounded channel, but that came at a high initialization cost because the bounded channels pre-allocate a fixed size buffer for the messages. The messages are WorkerResults (size_of::<WorkerResult>() == 312). This fixed-size buffer presumably has a size of channel_size × message_size, i.e. 0x4000 × 312 byte ≈ 4.9 MiB (?). And that was slowing us down by ~70 ms on your machine? Is it because each thread allocates those 5 MiB?

In this changeset, you switch back to an unbounded channel, but WorkerResults are re-cycled. I understand that this is getting rid of the pre-allocation hit. But what about long searches (when initialization time can be neglected)? You seem to indicate that we are still faster in those situations? Couldn't this optimization be applied to bounded crossbeam channels in general (for large message sizes)? Is the allocation/deallocation really that expensive that it is worth all of this overhead (creating a sender for each worker result, making an additional copy for each worker result when recycling, …)?

Don't get me wrong. I love that this works. I'm just a bit puzzled that this isn't a strategy that could be used to speed up bounded channels in general (or maybe it is?).

Would another allocator help? I'm not really knowledgeable here but it seems to me like this whole recycling-memory part should/could be the job of a (special-purpose) allocator?

@tavianator
Copy link
Collaborator Author

Wow 🤩

:)

I would like to understand this first before merging. The situation is the following: we have a MPSC scenario where the consumer is sometimes too slow to handle the incoming results (even though its only job is to print the paths to a console). In the past, we used an unbounded channel which would lead to high memory usage because the channel was buffering all those results.

Exactly. It was actually fairly easy to cause this by stalling the receiver thread with something like fd | less.

Then we switched to a bounded channel, but that came at a high initialization cost because the bounded channels pre-allocate a fixed size buffer for the messages. The messages are WorkerResults (size_of::<WorkerResult>() == 312). This fixed-size buffer presumably has a size of channel_size × message_size, i.e. 0x4000 × 312 byte ≈ 4.9 MiB (?).

Yeah that times the number of threads:

fd/src/walk.rs

Line 60 in 15329f9

let (tx, rx) = bounded(0x4000 * config.threads);

And that was slowing us down by ~70 ms on your machine? Is it because each thread allocates those 5 MiB?

I'm not sure exactly why bounded() channels are that slow to initialize, but the allocation happens up-front, not in each thread, which is part of the problem.

In this changeset, you switch back to an unbounded channel, but WorkerResults are re-cycled. I understand that this is getting rid of the pre-allocation hit. But what about long searches (when initialization time can be neglected)? You seem to indicate that we are still faster in those situations?

Yeah in my benchmarks it's universally faster. But I don't know the separate impacts of

  • Switching to an unbounded channel
  • Shrinking WorkerResult to WorkerMsg
  • Re-using Box<WorkerResult> allocations

Couldn't this optimization be applied to bounded crossbeam channels in general (for large message sizes)? Is the allocation/deallocation really that expensive that it is worth all of this overhead (creating a sender for each worker result, making an additional copy for each worker result when recycling, …)?

Actually Sender::clone() is fairly cheap, it just bumps an atomic refcount. I think it would even be possible to do

pub struct WorkerMsg<'a> {
    inner: Option<ResultBox>,
    tx: &'a Sender<ResultBox>,
}

but I'd have to plumb the lifetime through more stuff. WorkerState would have to own the Senders I think.

Don't get me wrong. I love that this works. I'm just a bit puzzled that this isn't a strategy that could be used to speed up bounded channels in general (or maybe it is?).

Maybe it's possible? Keep in mind this is somewhat different semantics because each thread is strictly limited to its pool of 0x4000 WorkerResult. In the previous implementation, the capacity was shared between threads.

Would another allocator help? I'm not really knowledgeable here but it seems to me like this whole recycling-memory part should/could be the job of a (special-purpose) allocator?

It's possible that https://github.com/microsoft/snmalloc would have some of the same benefits. But here's the thing: this approach needs the receiver thread to tell the sender thread that it has handled a WorkerResult somehow. My first thought was to use a semaphore to limit the number of WorkerResults allocated. But a single semaphore doesn't scale very well, so then I figured I could have a separate semaphore for each sender.

But actually, an SPSC channel scales better than a semaphore! That's because the head/tail index can be on separate cache lines. For a semaphore, both threads are always contending on the same counter. So for that reason I just went with the channel implementation (and because we already had it available; I'd have to go find/write a semaphore otherwise).

@tavianator
Copy link
Collaborator Author

I think it would even be possible to do

pub struct WorkerMsg<'a> {
    inner: Option<ResultBox>,
    tx: &'a Sender<ResultBox>,
}

but I'd have to plumb the lifetime through more stuff. WorkerState would have to own the Senders I think.

Update: I just tried a hacky version of this and it wasn't appreciably faster

@tavianator tavianator marked this pull request as draft November 1, 2023 20:52
@tmccombs
Copy link
Collaborator

tmccombs commented Nov 2, 2023

Re-using Box allocations

I suspect this is a pretty big component of it. Unless the worker generates results faster than the receiver can process them, and we get to the maximum amount of allocation.

It probably also helps that allocations for the channels now happens on the individual threads instead of in the spawning thread.

Looking at the code for crossbeam, it isn't just that we have to allocate the memory for the channel up front, we also have to initialize it, and it can't be done with memset or equivalent.

I'm just a bit puzzled that this isn't a strategy that could be used to speed up bounded channels in general (or maybe it is?)

I think it could be used in general. Basically implement a bounded channel as two unbounded channels and an atomic counter for the number of items allocated. And have the receiver pass the slot back after reading the value out of it. I'm not sure if that would universally improve performance, but it does have a couple of advantages: memory is allocated lazily, and it would be possible to dynamically change the size of the bound.

@tavianator tavianator force-pushed the unbounded branch 2 times, most recently from 76e8437 to 0fbbaae Compare November 2, 2023 14:28
@tavianator tavianator marked this pull request as ready for review November 2, 2023 14:28
@tavianator
Copy link
Collaborator Author

Looking at the code for crossbeam, it isn't just that we have to allocate the memory for the channel up front, we also have to initialize it, and it can't be done with memset or equivalent.

Well it could be done with memset() if they changed their representation slightly, e.g.

/// A slot in a channel.
struct Slot<T> {
    /// The current stamp.
    stamp: AtomicUsize,
    /// The message in this slot.
    msg: UnsafeCell<MaybeUninit<T>>,
}

impl<T> Slot<T> {
    fn new() -> Self {
        stamp: AtomicUsize::new(),
        msg: UnsafeCell::new(MaybeUninit::zeroed()),
    }
}
...
let slot = unsafe { self.buffer.get_unchecked(index) };
let stamp = slot.stamp.load(Ordering::Acquire) + index;

And if they did that they could allocate the whole buffer with mmap() and get lazily-initialized zero pages. I use a similar trick in bfs to initialize a whole linked list from zeroed memory: https://github.com/tavianator/bfs/blob/b2ab7a151fca517f4879e76e626ec85ad3de97c7/src/alloc.c#L63-L88

We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
@sharkdp
Copy link
Owner

sharkdp commented Nov 2, 2023

I did some benchmarks comparing current master (8bbbd76) with this branch (d588971).

I can confirm the huge increase in startup speed (hyperfine -w5 -N -L version master,1414 "./fd-{version} -u . /tmp/empty" --export-markdown -)

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -u . /tmp/empty 25.4 ± 10.0 18.1 54.7 7.24 ± 3.34
./fd-1414 -u . /tmp/empty 3.5 ± 0.9 2.6 9.5 1.00

Unfortunately, there seems to be a large regression for longer searches with many results (hyperfine -w5 -N -L version master,1414 "./fd-{version} -u . /folder/with/3M/files" --export-markdown -):

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -u . /home/ped1st/workspace 841.0 ± 125.2 737.2 1069.5 1.00
./fd-1414 -u . /home/ped1st/workspace 1386.5 ± 18.0 1358.8 1411.5 1.65 ± 0.25

Edit: a quick perf profile seems to indicate that the additional sending back of messages (?) could be the issue here:

image

@tavianator
Copy link
Collaborator Author

Indeed... most of my benchmarks are with the pattern ^$ which never matches, so it never sends anything over the channels! If I print paths then I can reproduce:

tavianator@tachyon $ hyperfine -w2 fd{,-{master,unbounded}}" -u --search-path ~/code/bfs/bench/corpus/chromium"
Benchmark 1: fd -u --search-path ~/code/bfs/bench/corpus/chromium
  Time (mean ± σ):      2.570 s ±  0.013 s    [User: 10.525 s, System: 99.826 s]
  Range (min … max):    2.550 s …  2.590 s    10 runs
 
Benchmark 2: fd-master -u --search-path ~/code/bfs/bench/corpus/chromium
  Time (mean ± σ):     796.4 ms ±  15.2 ms    [User: 11590.2 ms, System: 3956.6 ms]
  Range (min … max):   773.4 ms … 820.2 ms    10 runs
 
Benchmark 3: fd-unbounded -u --search-path ~/code/bfs/bench/corpus/chromium
  Time (mean ± σ):      1.065 s ±  0.065 s    [User: 7.692 s, System: 5.038 s]
  Range (min … max):    0.986 s …  1.162 s    10 runs
 
Summary
  fd-master -u --search-path ~/code/bfs/bench/corpus/chromium ran
    1.34 ± 0.09 times faster than fd-unbounded -u --search-path ~/code/bfs/bench/corpus/chromium
    3.23 ± 0.06 times faster than fd -u --search-path ~/code/bfs/bench/corpus/chromium

Let me try a non-hacky version of #1414 (comment) ...

@tavianator
Copy link
Collaborator Author

I just pushed a couple more commits with different strategies, including passing senders by reference and my own little semaphore implementation. Unfortunately, it seems like most of the overhead is actually from the unbounded channel implementation itself, i.e. initialization time is faster but sending is slower. Let me try a different approach.

@tmccombs
Copy link
Collaborator

tmccombs commented Nov 2, 2023

I wonder if it would be worthwhile to try to reduce the initialization overhead in crossbeam channel.

@tmccombs
Copy link
Collaborator

tmccombs commented Nov 4, 2023

A couple of other ideas:

We could reduce the size of the channel, which would speed up initialization, but might hurt performance if the bound is the bottleneck.

Instead of having a single channel, we could have each sender thread create its own channel, that it passes back to the main thread via another channel.

Then in the receiver we either select from all the channels, or if we spawn multiple receiver threads, each receiver could process a single input channel.

@tavianator
Copy link
Collaborator Author

A couple of other ideas:

We could reduce the size of the channel, which would speed up initialization, but might hurt performance if the bound is the bottleneck.

I've tried this before, usually it's a perf loss

Instead of having a single channel, we could have each sender thread create its own channel, that it passes back to the main thread via another channel.

Then in the receiver we either select from all the channels, or if we spawn multiple receiver threads, each receiver could process a single input channel.

This was the "different approach" I mentioned above. It's much slower.

But I just tried another idea: instead of individual WorkerResults, send something like Arc<Mutex<Option<Vec<WorkerResult>>>> over the channel. The senders keep adding to the same batch whenever they can, and the receiver thread drains whole batches at once. It's tremendously faster than everything else I've tried:

Benchmark 1: bfs bench/corpus/chromium                                                                                                                                                       
  Time (mean ± σ):     679.5 ms ±  23.1 ms    [User: 967.2 ms, System: 3165.0 ms]
  Range (min … max):   648.4 ms … 710.3 ms    10 runs                                  
                                                                                              
Benchmark 2: find bench/corpus/chromium                                                                                                                                                      
  Time (mean ± σ):      3.213 s ±  0.020 s    [User: 0.559 s, System: 2.602 s]                                                                                                               
  Range (min … max):    3.181 s …  3.245 s    10 runs                                                                                                                                        
                                                                                              
Benchmark 3: fd -u --search-path bench/corpus/chromium
  Time (mean ± σ):      2.637 s ±  0.011 s    [User: 10.695 s, System: 99.795 s]              
  Range (min … max):    2.621 s …  2.654 s    10 runs                    
                                                                                              
Benchmark 4: fd-master -u --search-path bench/corpus/chromium     
  Time (mean ± σ):     767.8 ms ±  36.4 ms    [User: 10885.6 ms, System: 3892.2 ms]
  Range (min … max):   725.8 ms … 817.7 ms    10 runs                       
                                                                                              
Benchmark 5: fd-batch -u --search-path bench/corpus/chromium                                  
  Time (mean ± σ):     293.2 ms ±   6.7 ms    [User: 2767.8 ms, System: 3095.5 ms]            
  Range (min … max):   282.3 ms … 302.3 ms    10 runs                      
                                                                                              
Summary                                                                                       
  fd-batch -u --search-path bench/corpus/chromium ran                                         
    2.32 ± 0.09 times faster than bfs bench/corpus/chromium                                   
    2.62 ± 0.14 times faster than fd-master -u --search-path bench/corpus/chromium            
    8.99 ± 0.21 times faster than fd -u --search-path bench/corpus/chromium               
   10.96 ± 0.26 times faster than find bench/corpus/chromium                                                                                                                                 ```

I'll put up a PR for it in a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve startup time excessive memory usage on huge trees
3 participants