gc: parallelize garbage collection #5961

isidentical · 2021-05-04T12:07:05Z

It seems like there is nothing blocking this, and for cloud providers, this might mean up to 16-20x speed up (dvc gc -c). Just for some numbers as the motivation, removing 1000~ cache files from the s3 takes about 20-25 minutes.

The text was updated successfully, but these errors were encountered:

skshetry · 2021-05-04T13:24:24Z

@isidentical, is this required for fsspec? If not, I'd prefer that we prioritize this #4218 over the performance. I am not trying to say that they should not be a priority, not trying to discourage you, but I feel strongly that we should have a way to remove a file from the cache.

isidentical · 2021-05-25T11:59:47Z

@skshetry is #4218 a blocker for this issue? I don't have the whole context, though it seems something relevant with the used cache calculation and what this issue proposes is speeding up the part after that (running fs.remove() calls concurrently).

Another occurrence we hit during my support duty is dvc gc -c being slow (taking 95+ hours): https://discord.com/channels/485586884165107732/563406153334128681/846717782145368134

skshetry · 2021-05-25T12:20:03Z

@skshetry is #4218 a blocker for this issue?

@isidentical, sorry for misleading, no it's not related. I was just trying to push #4218 up for prioritization. Please feel free to work on this. 🙂

Regarding #4218, the gc in whole is just stuck in the discussion for more than a year (see #5928). Internally, it might be already possible to implement that (even more so after #6008 with filter).

SolomidHero · 2022-11-10T08:29:10Z

I have 6M items on s3 storage and wanted to dvc gc them. It never finished. Even querying stage

daavoo · 2022-12-21T17:58:30Z

Coming from #8549 .

I have faced similar performance issues (latest version). It gets worse with a larger number of files but it is already a real pain around the order of 1000.

Even for just 100 files, there is too much overhead added:

AWS

# List
$ aws s3 ls s3://${BUCKET} --recursive
real    0m1.247s                                                                                                         
user    0m0.327s                                                                                                         
sys     0m0.072s
# Wipe
$ aws s3 rm s3://${BUCKET} --recursive
real    0m3.570s                                                                                                         
user    0m0.755s      
sys     0m0.128s

DVC

Note in the reproduction script below I have tried to minimize the overhead from unrelated operations (i.e. local gc, etc) ; you can verify in the attached profile that most time is spent on our equivalent of ls and rm.

$  time dvc gc -f -w -c
real    2m52.810s
user    0m5.885s
sys     0m0.790s

Reproduction script

import random
from pathlib import Path

data = Path("data")
data.mkdir(exist_ok=True)

for i in range(100):
    n = random.random()
    file = data / str(n)
    file.write_text(str(random.random()))

#!/bin/bash
BUCKET=diglesia-gc-testing
aws s3 rm s3://$BUCKET --recursive
rm -rf tmp
mkdir tmp

cp create_data.py tmp
cd tmp

git init
dvc init
dvc remote add -d myremote s3://$BUCKET            
git add .
git commit -m "init"

python create_data.py
dvc add data
# Just because is faster than push
aws s3 cp .dvc/cache s3://$BUCKET --recursive
# Don't spend time on local gc
rm -rf data
rm -rf .dvc/cache

python create_data.py
dvc add data
git add data.dvc
git commit -m "track data"

time dvc gc -f -w -c

dvc doctor

DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.9.5 on macOS-10.16-x86_64-i386-64bit
Subprojects:
        dvc_data = 0.28.4
        dvc_objects = 0.14.0
        dvc_render = 0.0.15
        dvc_task = 0.1.8
        dvclive = 1.2.2
        scmrepo = 0.1.4
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git

Viztracer profile (see https://github.com/iterative/dvc/wiki/Debugging,-Profiling-and-Benchmarking-DVC#generating-viztracer-data):

viztracer.dvc-20221221_185659.json.zip

daavoo · 2022-12-27T11:08:22Z

Using iterative/dvc-data#244 , for 1000 files the overhead is significantly reduced:

Without patch

$ time dvc gc -f -w -c
real    18m54.253s

With patch

$ time dvc gc -f -w -c
real    8m19.650s

However, there is still a ridiculous overhead compared to aws CLI:

$ time aws s3 ls s3://${BUCKET} --recursive
real    0m2.702s
$ time aws s3 rm s3://${BUCKET} --recursive
real    0m23.851s

For the rm part, checking the profiles shows that all overhead is introduced by _expand_path in https://github.com/fsspec/s3fs/blob/5917684f96226ee855921d1c614d21cc46b3edeb/s3fs/core.py#L1836 (which in this case for us is just a very expensive noop)

daavoo · 2022-12-28T12:14:03Z

For the rm part, checking the profiles shows that all overhead is introduced by _expand_path in https://github.com/fsspec/s3fs/blob/5917684f96226ee855921d1c614d21cc46b3edeb/s3fs/core.py#L1836 (which in this case for us is just a very expensive noop)

A quick and dirty change of removing _expand_path calls (which in our case have no effect) makes our rm operation to be on pair with aws s3 rm, so no overhead is introduced.

Still need to discuss after vacation how to properly address it in fsspec but IMO makes sense to be able to bypass _expand_path if you know you don't need it (like in our use case).

All the overhead left now comes from our ls vs aws s3 ls, will look into it if I find some time.

daavoo · 2022-12-28T13:52:48Z

I checked the ls part.

I am very confused about the usage of _list_oids_traverse inside odb.all and why it is so slow. Either something is just wrong with the mix of ThreadPool and async fsspec or there is some scenario where it's faster but I couldn't find it.

Replacing _list_oids_traverse with _list_oids (current behavior for "small" remotes) makes the operation extremely faster in all the remote sizes I have tried (and a different number of jobs for _list_oids_traverse), even thought it doesn't use multiple threads 🤷

Not using _list_oids_traverse in odb.all makes our gc -c fully on pair with aws s3 ls + aws s3 rm.

For the same 1000 files setup above:

$ time dvc gc -f -w -c
real    0m7.169s

I think all the changes mentioned above make sense in general for any remote.

Need to properly discuss how to integrate them in fsspec (not calling _expand_path if not needed) and dvc-objects (_list_oids is just much faster than _list_oids_traverse).

dberenbaum · 2023-01-03T21:27:54Z

@daavoo Do you know how performance is on other filesystems or clouds?

Edit: From a quick look, the base AsyncFileSystem in https://github.com/fsspec/filesystem_spec/blob/master/fsspec/asyn.py uses similar logic for _rm and _expand_path, so any other fs that inherits from it is likely to be similar. Seems like _expand_path will call at least _find and _exists, which I would guess are both relatively expensive operations on cloud.

daavoo · 2023-01-04T11:42:05Z

@daavoo Do you know how performance is on other filesystems or clouds?

There are 3 changes:

Use batch delete (gc: Pass list of paths to fs.remove. dvc-data#244)

This depends on whether the underlying filesystem implements "batch delete" or not.
For example, s3fs does implement it so the improvements are significant.
For filesystems where they just do a for loop, it won't be much of a difference.

Avoid unnecessary calls to expand_path.

What you commented on the edit.

It affects all filesystems following the official spec. The impact basically grows with the number of objects passed, it is not only about being expensive in cloud but also about blocking the async thread.

Avoid list_oids_traverse (_list_oids_traverse is much slower than _list_oids. dvc-objects#178)

Affects all filesystems as it's on our side.

Did a quick test for Azure, which doesn't implement bulk delete and just does a plain for loop.

For 1000 files:

Before

$ time dvc gc -f -w -c
real    9m48.269s

After

$ time dvc gc -f -w -c
real    5m02.686s

Given that there are no gains from "batch delete", all overhead was introduced by points 2 and 3.

dberenbaum · 2023-01-04T18:14:01Z

@daavoo Could you compare those results to az cli?

daavoo · 2023-01-10T13:21:02Z

@daavoo Could you compare those results to az cli?

AZ

# LS
$ az storage blob list -c gc-testing
real    0m2.078s

# RM
$ az storage blob delete-batch -s gc-testing
real    0m17.557s

DVC

This uses an unmerged patch I have sent upstream fsspec/adlfs#383

$ dvc gc -f -w -c
real    0m3.909s

I think it's actually faster because I might be misusing az cli and/or the upstream patch used by DVC better handles concurrency

dberenbaum · 2023-01-10T17:46:22Z

So my takeaway is that bulk delete matters a lot 😄

isidentical added the performance improvement over resource / time consuming tasks label May 4, 2021

isidentical self-assigned this May 4, 2021

isidentical removed their assignment May 25, 2021

skshetry mentioned this issue Feb 15, 2022

fix ignored job count #7373

Merged

daavoo added the A: gc Related go garbage collection label May 5, 2022

pmrowla changed the title ~~gc: pararellize garbage collection~~ gc: parallelize garbage collection Nov 12, 2022

pmrowla mentioned this issue Nov 12, 2022

gc: dvc gc -c is slow on s3 storage with many files #8549

Closed

daavoo mentioned this issue Dec 21, 2022

gc: Completely silent while removing files from remote. #8721

Open

dberenbaum mentioned this issue Jan 4, 2023

external outputs: broken if pipeline output doesn't exist during stage initialization #8757

Closed

pmrowla mentioned this issue Jan 4, 2023

index: parallelize checkout/save iterative/dvc-data#246

Merged

dberenbaum mentioned this issue Jan 4, 2023

remote transfer slow for unversioned data iterative/dvc-data#247

Closed

dberenbaum mentioned this issue Jan 12, 2023

fs: local: Accept list of paths in rm. iterative/dvc-objects#176

Merged

daavoo mentioned this issue Jan 17, 2023

gc: Pass list of paths to fs.remove. iterative/dvc-data#244

Merged

daavoo closed this as completed in iterative/dvc-data#244 Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gc: parallelize garbage collection #5961

gc: parallelize garbage collection #5961

isidentical commented May 4, 2021

skshetry commented May 4, 2021

isidentical commented May 25, 2021

skshetry commented May 25, 2021

SolomidHero commented Nov 10, 2022

daavoo commented Dec 21, 2022 •

edited

Loading

daavoo commented Dec 27, 2022 •

edited

Loading

daavoo commented Dec 28, 2022 •

edited

Loading

daavoo commented Dec 28, 2022 •

edited

Loading

dberenbaum commented Jan 3, 2023 •

edited

Loading

daavoo commented Jan 4, 2023 •

edited

Loading

dberenbaum commented Jan 4, 2023

daavoo commented Jan 10, 2023 •

edited

Loading

dberenbaum commented Jan 10, 2023

gc: parallelize garbage collection #5961

gc: parallelize garbage collection #5961

Comments

isidentical commented May 4, 2021

skshetry commented May 4, 2021

isidentical commented May 25, 2021

skshetry commented May 25, 2021

SolomidHero commented Nov 10, 2022

daavoo commented Dec 21, 2022 • edited Loading

daavoo commented Dec 27, 2022 • edited Loading

daavoo commented Dec 28, 2022 • edited Loading

daavoo commented Dec 28, 2022 • edited Loading

dberenbaum commented Jan 3, 2023 • edited Loading

daavoo commented Jan 4, 2023 • edited Loading

dberenbaum commented Jan 4, 2023

daavoo commented Jan 10, 2023 • edited Loading

dberenbaum commented Jan 10, 2023

daavoo commented Dec 21, 2022 •

edited

Loading

daavoo commented Dec 27, 2022 •

edited

Loading

daavoo commented Dec 28, 2022 •

edited

Loading

daavoo commented Dec 28, 2022 •

edited

Loading

dberenbaum commented Jan 3, 2023 •

edited

Loading

daavoo commented Jan 4, 2023 •

edited

Loading

daavoo commented Jan 10, 2023 •

edited

Loading