Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: improve traverse/no_traverse behavior #3488

Closed
pmrowla opened this issue Mar 17, 2020 · 0 comments · Fixed by #3501
Closed

remote: improve traverse/no_traverse behavior #3488

pmrowla opened this issue Mar 17, 2020 · 0 comments · Fixed by #3501
Assignees
Labels
enhancement Enhances DVC p1-important Important, aka current backlog of things to do

Comments

@pmrowla
Copy link
Contributor

pmrowla commented Mar 17, 2020

Currently for push/pull operations, we either fetch the full list of all files in the remote in order to determine what local files already exist in the remote cache, or we individually check whether or not each local file exists in the remote cache (no_traverse method).

As explained in #2044, the first approach works best for cases where the local cache is large, and the remote cache is small, and the no_traverse approach works best when the local cache is small and the remote is large.

By default, we currently use the no_traverse method by default for all remotes except gdrive (discounting remotes with overridden cache_exists() methods like ssh and local). no_traverse can be configured via an option to enable it for gdrive.

Rather than relying on the user to explicitly specify the no_traverse option, it is possible for us take some steps towards determining which method to use during a push/pull operation.

Since files in the remote cache will be evenly distributed (according to md5 hash), we can fetch a subset of files from the remote cache (rather than the full cache all at once), and use the size of that subset to estimate the size of the entire remote cache. For example, if we only fetch cache entries under a single parent directory (a.k.a. the first byte of the md5 hash for those entries), the total number of files on the remote would be roughly 256 * len(subset_results).

From there, we can determine whether or not to use the no_traverse method, by comparing the local file count to the estimated remote file count.

@pmrowla pmrowla added the enhancement Enhances DVC label Mar 17, 2020
@pmrowla pmrowla self-assigned this Mar 17, 2020
@efiop efiop added the p1-important Important, aka current backlog of things to do label Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC p1-important Important, aka current backlog of things to do
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants