Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add EXTRA_TRAVERSE_WEIGHT_MULTIPLIER environment var #291

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zldrobit
Copy link

I hosted a DVC remote storage on a S3 compatible server. The server has only rotational HDDs, and the S3.ListObjectsV2 is very slow making _list_oids_traverse() runs for a long time. It seems that IOPS of my server is much lower than an AWS S3 bucket. In this PR, I add an environment var EXTRA_TRAVERSE_WEIGHT_MULTIPLIER for users to set an extra weight multiplier when comparing listing and traverse weight in oids_exist(), so users could customize the traverse policy for variant S3 compatible service with different hard drive performance. E.g., setting EXTRA_TRAVERSE_WEIGHT_MULTIPLIER=1000 disabling traverse for me, and the elapsed time of dvc push drops from nearly two hours to less than 20 minutes (remote size = 500k).

Related issues:
#178
minio/minio#19532

@shcheklein
Copy link
Member

Thanks @zldrobit !

Could you please give us a bit more details first, also tbh, I'm not very comfortable introducing the whole env variable specifically for this case. If we decide it is worth it, I would rather do this as part of the DVC remote config. But I'm open to other opinion.

  • How many files are were you pushing? I don't remember all the details, but I assume the way it picks the way to go (traverse vs check) also should depend on the objects number we are trying to push.
  • Does it happen only the first time or for every dataset update?
  • S3.ListObjectsV2 - do we essentially run 500 API calls? and those are taking 2 hours? So, it's 10s+ per call? That sounds a bit extreme 🤔

@zldrobit
Copy link
Author

Of course. And I agree with you that an option in DVC remote config is a better option compared with an environment var.

How many files are were you pushing? I don't remember all the details, but I assume the way it picks the way to go (traverse vs check) also should depend on the objects number we are trying to push.

I only pushed one text file with several bytes just for testing, but the len(oids) needs to be listed is around 30k. The original dataset does contain 30k-40k objects already pushed to the remote. For convenience, I excerpt the code comparing the len(oids) and the traverse weight as follows

traverse_pages = remote_size / self.fs.LIST_OBJECT_PAGE_SIZE
# For sufficiently large remotes, traverse must be weighted to account
# for performance overhead from large lists/sets.
# From testing with S3, for remotes with 1M+ files, object_exists is
# faster until len(oids) is at least 10k~100k
if remote_size > self.fs.TRAVERSE_THRESHOLD_SIZE:
traverse_weight = traverse_pages * self.fs.TRAVERSE_WEIGHT_MULTIPLIER
else:
traverse_weight = traverse_pages
if len(oids) < traverse_weight and not always_traverse:
logger.debug(
"Large remote (%r oids < %r traverse weight), "
"using object_exists for remaining oids",
len(oids),
traverse_weight,
)
remaining_oids = oids - remote_oids
ret = list(oids & remote_oids)
callback = partial(progress, "querying", len(remaining_oids))
ret.extend(
wrap_iter(self.list_oids_exists(remaining_oids, jobs=jobs), callback)
)
return ret

Does it happen only the first time or for every dataset update?

It happens only the first time I pushed. After the first push, the len(oids) needs to be compared is the actual number (e.g. 2 including the dir object), so the traverse weight is bigger. In this case, dvc won't traverse the remote directories.

S3.ListObjectsV2 - do we essentially run 500 API calls? and those are taking 2 hours? So, it's 10s+ per call? That sounds a bit extreme

The remote contains dvc v2 and v3 folders, i.e. /[0-9a-e][0-9a-e] and /files/md5/[0-9a-e][0-9a-e]. At least 2 x 256 = 512 ListObjectsV2 API calls have to be run. I benchmarked the S3.ListObjectsV2 API, and the server-side log shows one of it elapsed for 18 seconds as below (minio/minio#19532 (comment))

2024-04-17T17:02:37.492 [200 OK] s3.ListObjectsV2 minio.utai.cn:9000/dvc-remote-storage/?delimiter=%2F&encoding-type=url&fetch-owner=true&list-type=2&prefix=files%2Fmd5%2F09%2F  172.21.0.3       18.3956s     ⇣  18.393351157s    ↑ 93 B ↓ 36 KiB

The S3 compatible server I hosted translates each s3.ListObjectsV2 call to multiple stat() system calls on a slow HDD RAID. Assuming each stat() consumes 15-25ms, calling thousands stat() sequentially takes tens of seconds. Moreover, I reproduced the elapsed time by executing ls -R command in dvc remote cache directories (e.g. /00, /01, and /02). This shows my S3 compatible server is a lot slower than the official AWS S3 server considering listing performance.

@shcheklein
Copy link
Member

It happens only the first time I pushed. After the first push, the len(oids) needs to be compared is the actual number (e.g. 2 including the dir object), so the traverse weight is bigger. In this case, dvc won't traverse the remote directories.

so, do you expect this warm up and full traverse to happen often? How big of a priority this for you, is it a blocker? could you describe the workflow? (we can even jump on a call and chat, may be I can suggest a different way to handle this).

I'm bit uncomfortable to be honest to complicate the algorithm or introduce new env vars (and we actually want to reconsider it and use the local caching) because of the specific S3 implementation.

@zldrobit
Copy link
Author

zldrobit commented Apr 22, 2024

so, do you expect this warm up and full traverse to happen often? How big of a priority this for you, is it a blocker?

Not very often. My colleagues encountered a long-time dvc push after reinstalling the OS and pulling a dataset to a new path. In this case, we could use the warm-uped dataset to avoid the long push time. However, dvc push does happen inevitably if my colleagues exchange a dataset using a git/https link, which is a blocker in such a situation.

could you describe the workflow? (we can even jump on a call and chat, may be I can suggest a different way to handle this).

Yes, we have 3 training servers connected by a 1Gb network. They share a dvc cache backed by NFS. We also have a dvc remote with 70TB storage connected with training servers using a 1Gb network. dvc add is always fast while dvc push is slow for the first time. The workflow is simple: My colleague adds several images and labels into the dataset (~40k objects) from time to time, and executes dvc add and dvc push to sync the dataset cache with the remote (~500k objects). Then, he uses this new dataset to train a corresponding new model.

I'm bit uncomfortable to be honest to complicate the algorithm or introduce new env vars (and we actually want to reconsider it and use the local caching) because of the specific S3 implementation.

I guess the file hierarchy on the dvc cache is the same as the dvc remote storage. If this is true, we could avoid using dvc push and manually sync the cache with the remote, yet this brings a concern about the consistency issue. IMO, many organizations have legacy low-price HDDs, and it's very convenient for them to set up a local S3 server. If there's an additional remote config option to tune the hard disk performance, many users of dvc will benefit from it.

@zldrobit
Copy link
Author

Hi @shcheklein, friendly ping. It's been about 3 weeks since the last comment. Would you spare your time to advise us to improve the workflow considering my last comment?

@shcheklein
Copy link
Member

@zldrobit sorry for the delay.

My colleagues encountered a long-time dvc push after reinstalling the OS and pulling a dataset to a new path. In this case, we could use the warm-uped dataset to avoid the long push time.

I assume this is rare and doesn't require a fix?

However, dvc push does happen inevitably if my colleagues exchange a dataset using a git/https link, which is a blocker in such a situation.

could you clarify this please?

@zldrobit
Copy link
Author

zldrobit commented Jun 7, 2024

Sorry for the late reply, too 😅

However, dvc push does happen inevitably if my colleagues exchange a dataset using a git/https link, which is a blocker in such a situation.

could you clarify this please?

Of course. When my colleague sends a dataset to me with a git link containing .dvc files. It takes me a long time to execute dvc push to sync the incremental data (e.g., more than 40k files) to the remote dvc storage (minio s3). I guess this is because of the slow execution of the S3.ListObjectsV2 command initiated by the dvc traverse function within s3fs/aiobotocore. It incurs a timeout as the log below:

2024-06-06 13:49:22,051 DEBUG: v3.49.0 (pip), CPython 3.10.12 on Linux-6.5.0-35-generic-x86_64-with-glibc2.35
2024-06-06 13:49:22,052 DEBUG: command: /home/fjc/.local/bin/dvc push -vvv
2024-06-06 13:49:22,052 TRACE: Namespace(quiet=0, verbose=3, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='push', jobs=None, targets=[], remote=None, all_branches=False, all_tags=False, all_commits=False, with_deps=False, recursive=False, run_cache=False, glob=False, func=<class 'dvc.commands.data_sync.CmdDataPush'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'dvc.cli.formatter.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2024-06-06 13:49:22,451 TRACE:    28.89 ms in collecting stages from /data/home/fjc/workspace/dvc-test
Collecting                                                                                                                                                                                                        |40.0k [00:03, 11.4kentry/s]2024-06-06 13:49:26,686 DEBUG: Preparing to transfer data from '/mnt/nfs/dvc-cache/files/md5' to 's3://dvc-remote-storage/files/md5'
2024-06-06 13:49:26,687 DEBUG: Preparing to collect status from 'dvc-remote-storage/files/md5'
2024-06-06 13:49:26,702 DEBUG: Collecting status from 'dvc-remote-storage/files/md5'
2024-06-06 13:49:26,703 DEBUG: Querying 1 oids via object_exists
2024-06-06 13:49:27,230 DEBUG: Estimated remote size: 573440 files
2024-06-06 13:49:27,232 DEBUG: Querying 40001 oids via traverse
Pushing
2024-06-06 13:57:43,593 ERROR: unexpected error - Read timeout on endpoint URL: "https://minio.utai.cn:9000/dvc-remote-storage?list-type=2&prefix=files%2Fmd5%2F06%2F&delimiter=&encoding-type=url"
Traceback (most recent call last):
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/httpsession.py", line 223, in send
    response = await session.request(
  File "/home/fjc/.local/lib/python3.10/site-packages/aiohttp/client.py", line 605, in _request
    await resp.start(conn)
  File "/home/fjc/.local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 966, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/home/fjc/.local/lib/python3.10/site-packages/aiohttp/streams.py", line 622, in read
    await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main
    ret = cmd.do_run()
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc/cli/command.py", line 27, in do_run
    return self.run()
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 64, in run
    processed_files_count = self.repo.push(
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc/repo/push.py", line 140, in push
    push_transferred, push_failed = ipush(
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_data/index/push.py", line 76, in push
    result = transfer(
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_data/hashfile/transfer.py", line 203, in transfer
    status = compare_status(
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 179, in compare_status
    dest_exists, dest_missing = status(
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 151, in status
    exists.update(odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback))
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/db.py", line 454, in oids_exist
    return list(oids & set(wrap_iter(remote_oids, callback)))
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/db.py", line 35, in wrap_iter
    for index, item in enumerate(iterable, start=1):
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/db.py", line 346, in _list_oids_traverse
    yield from self._list_oids(prefixes=traverse_prefixes, jobs=jobs)
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/db.py", line 250, in _list_oids
    for path in self._list_prefixes(prefixes=prefixes, jobs=jobs):
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/db.py", line 225, in _list_prefixes
    yield from self.fs.find(paths, batch_size=jobs, prefix=prefix)
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 832, in find
    for result in fut.result():
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/executors.py", line 88, in batch_coros
    result = fut.result()
  File "/home/fjc/.local/lib/python3.10/site-packages/s3fs/core.py", line 845, in _find
    out = await self._lsdir(path, delimiter="", prefix=prefix, **kwargs)
  File "/home/fjc/.local/lib/python3.10/site-packages/s3fs/core.py", line 720, in _lsdir
    async for c in self._iterdir(
  File "/home/fjc/.local/lib/python3.10/site-packages/s3fs/core.py", line 770, in _iterdir
    async for i in it:
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/paginate.py", line 30, in __anext__
    response = await self._make_request(current_kwargs)
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/client.py", line 388, in _make_api_call
    http, parsed_response = await self._make_request(
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/client.py", line 416, in _make_request
    return await self._endpoint.make_request(
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 102, in _send_request
    while await self._needs_retry(
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 264, in _needs_retry
    responses = await self._event_emitter.emit(
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/hooks.py", line 66, in _emit
    response = await resolve_awaitable(handler(**kwargs))
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/_helpers.py", line 15, in resolve_awaitable
    return await obj
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/retryhandler.py", line 107, in _call
    if await resolve_awaitable(self._checker(**checker_kwargs)):
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/_helpers.py", line 15, in resolve_awaitable
    return await obj
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/retryhandler.py", line 126, in _call
    should_retry = await self._should_retry(
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/retryhandler.py", line 165, in _should_retry
    return await resolve_awaitable(
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/_helpers.py", line 15, in resolve_awaitable
    return await obj
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/retryhandler.py", line 174, in _call
    checker(attempt_number, response, caught_exception)
  File "/home/fjc/.local/lib/python3.10/site-packages/botocore/retryhandler.py", line 247, in __call__
    return self._check_caught_exception(
  File "/home/fjc/.local/lib/python3.10/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
    raise caught_exception
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 183, in _do_get_response
    http_response = await self._send(request)
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 287, in _send
    return await self.http_session.send(request)
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/httpsession.py", line 262, in send
    raise ReadTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "https://minio.utai.cn:9000/dvc-remote-storage?list-type=2&prefix=files%2Fmd5%2F06%2F&delimiter=&encoding-type=url"

2024-06-06 13:57:43,734 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2024-06-06 13:57:43,734 DEBUG: Removing '/data/home/fjc/workspace/.Azo47c6JIOJnL2qYVGdywg.tmp'
2024-06-06 13:57:43,734 DEBUG: link type hardlink is not available ([Errno 95] no more link types left to try out)
2024-06-06 13:57:43,734 DEBUG: Removing '/data/home/fjc/workspace/.Azo47c6JIOJnL2qYVGdywg.tmp'
2024-06-06 13:57:43,735 DEBUG: Removing '/data/home/fjc/workspace/.Azo47c6JIOJnL2qYVGdywg.tmp'
2024-06-06 13:57:43,735 DEBUG: Removing '/mnt/nfs/dvc-cache/files/md5/.Oe-yj30SzrxIEHcprHQAcA.tmp'
2024-06-06 13:57:43,778 DEBUG: Version info for developers:
DVC version: 3.49.0 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.1
        dvc_task = 0.4.0
        scmrepo = 3.3.1
Supports:
        http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.3.1, boto3 = 1.34.51)
Config:
        Global: /home/fjc/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: nfs4 on dvc-cache.utai.cn:/mnt/nfs/dvc-cache
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/sda1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/17789e48ddee484862f237aa36a175a2

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2024-06-06 13:57:43,787 DEBUG: Analytics is disabled.
2024-06-06 13:57:43,787 TRACE: Process 391989 exiting with 255
sys:1: RuntimeWarning: coroutine 'S3FileSystem._find' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

I had to change the read timeout to 3600 by using dvc remote modify myremote read_timeout 3600, and got another timeout error:

2024-06-06 14:43:10,621 DEBUG: v3.49.0 (pip), CPython 3.10.12 on Linux-6.5.0-35-generic-x86_64-with-glibc2.35
2024-06-06 14:43:10,621 DEBUG: command: /home/fjc/.local/bin/dvc push -vvv
2024-06-06 14:43:10,621 TRACE: Namespace(quiet=0, verbose=3, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='push', jobs=None, targets=[], remote=None, all_branches=False, all_tags=False, all_commits=False, with_deps=False, recursive=False, run_cache=False, glob=False, func=<class 'dvc.commands.data_sync.CmdDataPush'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'dvc.cli.formatter.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2024-06-06 14:43:10,888 TRACE:    12.88 ms in collecting stages from /data/home/fjc/workspace/dvc-test
Collecting                                                                                                                                                                                                          |0.00 [00:00,    ?entry/s]2024-06-06 14:43:11,579 DEBUG: Preparing to transfer data from '/mnt/nfs/dvc-cache/files/md5' to 's3://dvc-remote-storage/files/md5'
2024-06-06 14:43:11,579 DEBUG: Preparing to collect status from 'dvc-remote-storage/files/md5'
2024-06-06 14:43:11,595 DEBUG: Collecting status from 'dvc-remote-storage/files/md5'
2024-06-06 14:43:11,600 DEBUG: Querying 1 oids via object_exists
2024-06-06 14:43:11,919 DEBUG: Estimated remote size: 573440 files
2024-06-06 14:43:11,920 DEBUG: Querying 40001 oids via traverse
Pushing
2024-06-06 14:59:16,831 ERROR: unexpected error - The difference between the request time and the server's time is too large.: An error occurred (RequestTimeTooSkewed) when calling the ListObjectsV2 operation: The difference between the request time and the server's time is too large.
Traceback (most recent call last):
  File "/home/fjc/.local/lib/python3.10/site-packages/s3fs/core.py", line 720, in _lsdir
    async for c in self._iterdir(
  File "/home/fjc/.local/lib/python3.10/site-packages/s3fs/core.py", line 770, in _iterdir
    async for i in it:
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/paginate.py", line 30, in __anext__
    response = await self._make_request(current_kwargs)
  File "/home/fjc/.local/lib/python3.10/site-packages/aiobotocore/client.py", line 408, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (RequestTimeTooSkewed) when calling the ListObjectsV2 operation: The difference between the request time and the server's time is too large.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main
    ret = cmd.do_run()
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc/cli/command.py", line 27, in do_run
    return self.run()
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 64, in run
    processed_files_count = self.repo.push(
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc/repo/push.py", line 140, in push
    push_transferred, push_failed = ipush(
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_data/index/push.py", line 76, in push
    result = transfer(
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_data/hashfile/transfer.py", line 203, in transfer
    status = compare_status(
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 179, in compare_status
    dest_exists, dest_missing = status(
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 151, in status
    exists.update(odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback))
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/db.py", line 454, in oids_exist
    return list(oids & set(wrap_iter(remote_oids, callback)))
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/db.py", line 35, in wrap_iter
    for index, item in enumerate(iterable, start=1):
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/db.py", line 346, in _list_oids_traverse
    yield from self._list_oids(prefixes=traverse_prefixes, jobs=jobs)
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/db.py", line 250, in _list_oids
    for path in self._list_prefixes(prefixes=prefixes, jobs=jobs):
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/db.py", line 225, in _list_prefixes
    yield from self.fs.find(paths, batch_size=jobs, prefix=prefix)
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 832, in find
    for result in fut.result():
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/fjc/.local/lib/python3.10/site-packages/dvc_objects/executors.py", line 88, in batch_coros
    result = fut.result()
  File "/home/fjc/.local/lib/python3.10/site-packages/s3fs/core.py", line 845, in _find
    out = await self._lsdir(path, delimiter="", prefix=prefix, **kwargs)
  File "/home/fjc/.local/lib/python3.10/site-packages/s3fs/core.py", line 733, in _lsdir
    raise translate_boto_error(e)
PermissionError: The difference between the request time and the server's time is too large.

2024-06-06 14:59:16,973 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2024-06-06 14:59:16,974 DEBUG: Removing '/data/home/fjc/workspace/.mcYMer0mAPkVxuPENyWGpg.tmp'
2024-06-06 14:59:16,974 DEBUG: link type hardlink is not available ([Errno 95] no more link types left to try out)
2024-06-06 14:59:16,974 DEBUG: Removing '/data/home/fjc/workspace/.mcYMer0mAPkVxuPENyWGpg.tmp'
2024-06-06 14:59:16,974 DEBUG: Removing '/data/home/fjc/workspace/.mcYMer0mAPkVxuPENyWGpg.tmp'
2024-06-06 14:59:16,974 DEBUG: Removing '/mnt/nfs/dvc-cache/files/md5/.A_cxKcYcaJlY32J5UBsYjw.tmp'
2024-06-06 14:59:17,004 DEBUG: Version info for developers:
DVC version: 3.49.0 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.1
        dvc_task = 0.4.0
        scmrepo = 3.3.1
Supports:
        http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.3.1, boto3 = 1.34.51)
Config:
        Global: /home/fjc/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: nfs4 on dvc-cache.utai.cn:/mnt/nfs/dvc-cache
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/sda1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/17789e48ddee484862f237aa36a175a2

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2024-06-06 14:59:17,007 DEBUG: Analytics is disabled.
2024-06-06 14:59:17,007 TRACE: Process 477880 exiting with 255
sys:1: RuntimeWarning: coroutine 'S3FileSystem._find' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

The RequestTimeTooSkewed is from s3fs/aiobotocore while aiobotocore initiates multiple async coroutines to traverse the remote storage filesystem. IIRC, this error happens when the s3 command request time differs from the current time by more than 15 minutes, which is mandated by the s3 protocol. I guess aiobotocore or python is not able to schedule all coroutines within a 15-minute range. After I decreased the concurrent job number to 2 by dvc push -j option, I could finally get around the problem, but the traverse time was 30-60 minutes.

I understand it's not ideal to change the weight algorithm just for our use case, and this may introduce another problem (e.g., issuing too many single-file requests). Maybe I have to improve the underlying minio S3 IOPS performance as well. Or just use another type of remote storage (e.g., SFTP, HTTP, or WebDAV).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants