Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize dircache timeout #215

Closed
TomAugspurger opened this issue Dec 3, 2019 · 2 comments
Closed

Standardize dircache timeout #215

TomAugspurger opened this issue Dec 3, 2019 · 2 comments

Comments

@TomAugspurger
Copy link
Collaborator

TomAugspurger commented Dec 3, 2019

Currently gcsfs implements a cache_timeout and s3fs might want one too (fsspec/s3fs#271, fsspec/s3fs#253). So maybe we should just add it to AbstractFileSystem.

Proposal

Adds a cache_timeout to AbstractFileSystem.__init__. Taken from the gcsfs docs

    cache_timeout: float, optional
        Cache expiration time in seconds for object metadata cache.
        Set cache_timeout <= 0 for no caching, None for no cache expiration.

As for the default, I think None makes the most sense for now since that's what gcsfs does. We can explore deprecating None in favor of something else (I think 0 for no caching) down the road.

One question: is this timeout per-key, per-directory, or global to the filesystem instance? In the gcsfs docs, we state

GCSFileSystem maintains a per-implied-directory cache of object listings and
    fulfills all object information and listing requests from cache. 

But from what I can tell, the cache is actually per object, not per-directory.

    @_tracemethod
    def _list_objects(self, path):
        ...
        listing = self._do_list_objects(path)
        retrieved_time = time.time()

        self._listing_cache[path] = (retrieved_time, listing)
        return listing

and

    @_tracemethod
    def _maybe_get_cached_listing(self, path):
        ...
        if path in self._listing_cache:
            retrieved_time, listing = self._listing_cache[path]
            cache_age = time.time() - retrieved_time
            if self.cache_timeout is not None and cache_age > self.cache_timeout:
                logger.debug(
                    "expired cache path: %s retrieved_time: %.3f cache_age: "
                    "%.3f cache_timeout: %.3f",
                    path,
                    retrieved_time,
                    cache_age,
                    self.cache_timeout,
                )
                del self._listing_cache[path]
                return None

            return listing

        return None
@TomAugspurger
Copy link
Collaborator Author

cc @martindurant

@martindurant
Copy link
Member

Should be per listing, i.e., a directory of stuff. A listing could be a single key-item in the case that the parent hadn't been previously listed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants