You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently for push/pull operations, we either fetch the full list of all files in the remote in order to determine what local files already exist in the remote cache, or we individually check whether or not each local file exists in the remote cache (no_traverse method).
As explained in #2044, the first approach works best for cases where the local cache is large, and the remote cache is small, and the no_traverse approach works best when the local cache is small and the remote is large.
By default, we currently use the no_traverse method by default for all remotes except gdrive (discounting remotes with overridden cache_exists() methods like ssh and local). no_traverse can be configured via an option to enable it for gdrive.
Rather than relying on the user to explicitly specify the no_traverse option, it is possible for us take some steps towards determining which method to use during a push/pull operation.
Since files in the remote cache will be evenly distributed (according to md5 hash), we can fetch a subset of files from the remote cache (rather than the full cache all at once), and use the size of that subset to estimate the size of the entire remote cache. For example, if we only fetch cache entries under a single parent directory (a.k.a. the first byte of the md5 hash for those entries), the total number of files on the remote would be roughly 256 * len(subset_results).
From there, we can determine whether or not to use the no_traverse method, by comparing the local file count to the estimated remote file count.
The text was updated successfully, but these errors were encountered:
Currently for
push
/pull
operations, we either fetch the full list of all files in the remote in order to determine what local files already exist in the remote cache, or we individually check whether or not each local file exists in the remote cache (no_traverse
method).As explained in #2044, the first approach works best for cases where the local cache is large, and the remote cache is small, and the
no_traverse
approach works best when the local cache is small and the remote is large.By default, we currently use the
no_traverse
method by default for all remotes exceptgdrive
(discounting remotes with overriddencache_exists()
methods likessh
andlocal
).no_traverse
can be configured via an option to enable it forgdrive
.Rather than relying on the user to explicitly specify the
no_traverse
option, it is possible for us take some steps towards determining which method to use during apush
/pull
operation.Since files in the remote cache will be evenly distributed (according to md5 hash), we can fetch a subset of files from the remote cache (rather than the full cache all at once), and use the size of that subset to estimate the size of the entire remote cache. For example, if we only fetch cache entries under a single parent directory (a.k.a. the first byte of the md5 hash for those entries), the total number of files on the remote would be roughly
256 * len(subset_results)
.From there, we can determine whether or not to use the
no_traverse
method, by comparing the local file count to the estimated remote file count.The text was updated successfully, but these errors were encountered: