Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache timeout and directory cache #216

Closed
wants to merge 2 commits into from

Conversation

TomAugspurger
Copy link
Collaborator

Closes #215

WIP for now, I need to incorporate #215 (comment).

@TomAugspurger
Copy link
Collaborator Author

@martindurant Is there a standard schema for what entries in DirCache should look like? I see that github and ftp use them. In fsspec/spec.py, it's only used in _ls_from_cache, which seems to expect something like

{
    "path": {
        "name": name,
    }
} 

Standardizing the structure here would be good (maybe in a dataclass?), but I'm not sure what all to include yet.

@TomAugspurger
Copy link
Collaborator Author

For ftp, elements of DirCache are List[Tuple[str, Dict]], where the first item of the tuple is the path of the element, and the dict has a schema like modify, perm, size, file, unique, name.

(Pdb) pp out[:2]
[('__init__.py',
  {'modify': '20190813183127',
   'name': '/__init__.py',
   'perm': 'r',
   'size': 0,
   'type': 'file',
   'unique': '1000004g2058413d7'}),
 ('__pycache__',
  {'modify': '20191127162327',
   'name': '/__pycache__',
   'perm': 'el',
   'size': 0,
   'type': 'dir',
   'unique': '1000004g206bad42b'})]

those are all the files under the path (the key).

For github we just have a List[Dict], and the keys in the dict are name, mode, type, size, sha.

These are inconsistent. At the moment, I'm leaning toward a namedtuple structure like

CacheItem = namedtuple("CacheItem", ["name", "details"])

where name is a string, and details is a dict with anything. Hopefully that will suffice.

@martindurant
Copy link
Member

The canonical structure should be:

{'cached_path`: [
    {"name": 'file_path", 
     "size": 10,
     "type": "file"},
    ...
   ]
}

The FTP case is clearly based on the output of the client library, and ought to be processed into canonical form, as it done for s3, gcs...

@martindurant
Copy link
Member

martindurant commented Dec 5, 2019

i.e., the key is the path that we did a listing for
(but I'm fine with the inner structure being dict-like too, so we can find an entry quickly; however, it may be possible on, e.g., s3, that two identical names exist, one as a prefix and one as a file)

@martindurant
Copy link
Member

This still looks useful to me

@martindurant
Copy link
Member

Superceded by #243

@TomAugspurger TomAugspurger deleted the dircache branch December 22, 2020 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standardize dircache timeout
2 participants