remote: improve `traverse`/`no_traverse` behavior #3488

pmrowla · 2020-03-17T06:03:36Z

Currently for push/pull operations, we either fetch the full list of all files in the remote in order to determine what local files already exist in the remote cache, or we individually check whether or not each local file exists in the remote cache (no_traverse method).

As explained in #2044, the first approach works best for cases where the local cache is large, and the remote cache is small, and the no_traverse approach works best when the local cache is small and the remote is large.

By default, we currently use the no_traverse method by default for all remotes except gdrive (discounting remotes with overridden cache_exists() methods like ssh and local). no_traverse can be configured via an option to enable it for gdrive.

Rather than relying on the user to explicitly specify the no_traverse option, it is possible for us take some steps towards determining which method to use during a push/pull operation.

Since files in the remote cache will be evenly distributed (according to md5 hash), we can fetch a subset of files from the remote cache (rather than the full cache all at once), and use the size of that subset to estimate the size of the entire remote cache. For example, if we only fetch cache entries under a single parent directory (a.k.a. the first byte of the md5 hash for those entries), the total number of files on the remote would be roughly 256 * len(subset_results).

From there, we can determine whether or not to use the no_traverse method, by comparing the local file count to the estimated remote file count.

The text was updated successfully, but these errors were encountered:

pmrowla added the enhancement Enhances DVC label Mar 17, 2020

pmrowla self-assigned this Mar 17, 2020

efiop added the p1-important Important, aka current backlog of things to do label Mar 17, 2020

pmrowla mentioned this issue Mar 18, 2020

remote: Optimize traverse/no_traverse behavior #3501

Merged

11 tasks

efiop closed this as completed in #3501 Mar 25, 2020

pmrowla mentioned this issue Jan 5, 2023

_list_oids_traverse is much slower than _list_oids. iterative/dvc-objects#178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote: improve `traverse`/`no_traverse` behavior #3488

remote: improve `traverse`/`no_traverse` behavior #3488

pmrowla commented Mar 17, 2020 •

edited

Loading

remote: improve traverse/no_traverse behavior #3488

remote: improve traverse/no_traverse behavior #3488

Comments

pmrowla commented Mar 17, 2020 • edited Loading

remote: improve `traverse`/`no_traverse` behavior #3488

remote: improve `traverse`/`no_traverse` behavior #3488

pmrowla commented Mar 17, 2020 •

edited

Loading