how to disable implicit directory listing? #285

rabernat · 2020-02-14T18:54:58Z

I do not like how s3fs implicitly lists directories. This can lead to extreme performance degradation for directories with large numbers of objects. Here is a simple example

import fsspec

# get a public object from s3 via regular http
url_base = 'https://mur-sst.s3.us-west-2.amazonaws.com/zarr/'
mapper_http = fsspec.get_mapper(url_base)
%time _ = mapper_http['analysed_sst/.zarray']
# > Wall time: 280 ms

# now do this with s3fs
file_location = 's3://mur-sst/zarr'
mapper = fsspec.get_mapper(file_location, anon=True)
%time _ = mapper['analysed_sst/.zarray']
# > Wall time: 41.4 s

The reason the second version is so slow is that fsspec is listing the directory contents, even though I know exactly which object I would like. Note that this behavior is unchanged if I add check=False.

Explicit is better than implicit is a a key part of the Zen of python. I would really prefer if this directory listing / caching had to be enabled explicitly, with the default behavior being simply passing through the objects with as little overhead as possible.

This issue is also raised in #279.

cc @cgentemann

The text was updated successfully, but these errors were encountered:

martindurant · 2020-02-14T18:59:55Z

It would be simple enough to have s3fs get the metadata for only the one file of interest for info and open actions. However, this is still a question of preference: there will be cases where you want to access all the files in some directory, and it would be much faster to do the one list operation and thereafter read from the cache. How do you suppose the API for expressing this preference might look?

rabernat · 2020-02-14T19:01:55Z

How do you suppose the API for expressing this preference might look?

mapper = fsspec.get_mapper(file_location, anon=True, cache_directory_listing=True)

where the default (cache_directory_listing=False) would have the same behavior and same performance as HTTPFileSystem.

rabernat · 2020-02-14T19:02:33Z

Or even keep caching as is, but just don't list directories unless ls or similar operations are explicitly called.

martindurant · 2020-02-14T19:06:32Z

Right, there are two preferences to choose on:

whether info calls ls if the file isn't already cached
how we cache listings, whether they expire, etc.

If we don't cache at all, obviously info->ls is a bad idea; if we do, then it may or may not make sense, depending on the situation.

It's worth pointing out that the cost to list a directory is the same as to get details on a single key for small listings (<1000s of entries), so there would be no good reason not to, unless you expect the bucket contents to be volatile.

martindurant · 2020-02-14T19:14:40Z

Ref fsspec/filesystem_spec#216
(@TomAugspurger , you may have thoughts on this)

martindurant · 2020-02-14T19:47:16Z

The simplest change looks like this

--- a/s3fs/core.py
+++ b/s3fs/core.py
@@ -451,7 +451,7 @@ class S3FileSystem(AbstractFileSystem):
                 raise ValueError("version_id cannot be specified if the "
                                  "filesystem is not version aware")
             kwargs['VersionId'] = version_id
-        if self.version_aware:
+        if self.version_aware or self._ls_from_cache(path) is None:
             try:
                 bucket, key = self.split_path(path)
                 out = self._call_s3(self.s3.head_object, kwargs, Bucket=bucket,

rabernat · 2020-02-17T03:12:38Z

Hi @martindurant -- could you give some guidance on the best way forward to address this issue? I'm keen to fix this, since it has major performance impacts on datasets stored in S3.

martindurant · 2020-02-17T15:36:28Z

Can you try with the small change above? I am prepared to include it in the codebase, but we still do need a more rigorous way to deal with this.

rabernat · 2020-02-25T21:33:32Z

Hi Martin...Sorry I did not get a chance to test this before leaving for vacation last week. Thanks for your quick fix.

martindurant · 2020-02-25T21:59:56Z

pleasure :)

rabernat changed the title ~~how to disable implicit directory listing~~ how to disable implicit directory listing? Feb 14, 2020

rabernat mentioned this issue Feb 16, 2020

before loading MUR, explain issues with performance cgentemann/osm2020tutorial#6

Closed

martindurant mentioned this issue Feb 21, 2020

Let info use HEAD if listing is not cached #286

Merged

martindurant closed this as completed in #286 Feb 25, 2020

rabernat mentioned this issue Mar 5, 2020

Errors using to_zarr for an s3 store pydata/xarray#3831

Closed

rabernat mentioned this issue Mar 30, 2020

async in zarr zarr-developers/zarr-python#536

Closed

rabernat mentioned this issue Apr 27, 2020

Rechunking zarr file on S3 becomes very slow as number of initialized chunks increase zarr-developers/zarr-python#559

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to disable implicit directory listing? #285

how to disable implicit directory listing? #285

rabernat commented Feb 14, 2020

martindurant commented Feb 14, 2020

rabernat commented Feb 14, 2020

rabernat commented Feb 14, 2020

martindurant commented Feb 14, 2020

martindurant commented Feb 14, 2020

martindurant commented Feb 14, 2020

rabernat commented Feb 17, 2020

martindurant commented Feb 17, 2020

rabernat commented Feb 25, 2020

martindurant commented Feb 25, 2020

how to disable implicit directory listing? #285

how to disable implicit directory listing? #285

Comments

rabernat commented Feb 14, 2020

martindurant commented Feb 14, 2020

rabernat commented Feb 14, 2020

rabernat commented Feb 14, 2020

martindurant commented Feb 14, 2020

martindurant commented Feb 14, 2020

martindurant commented Feb 14, 2020

rabernat commented Feb 17, 2020

martindurant commented Feb 17, 2020

rabernat commented Feb 25, 2020

martindurant commented Feb 25, 2020