-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to disable implicit directory listing? #285
Comments
It would be simple enough to have s3fs get the metadata for only the one file of interest for info and open actions. However, this is still a question of preference: there will be cases where you want to access all the files in some directory, and it would be much faster to do the one list operation and thereafter read from the cache. How do you suppose the API for expressing this preference might look? |
mapper = fsspec.get_mapper(file_location, anon=True, cache_directory_listing=True) where the default ( |
Or even keep caching as is, but just don't list directories unless |
Right, there are two preferences to choose on:
If we don't cache at all, obviously info->ls is a bad idea; if we do, then it may or may not make sense, depending on the situation. It's worth pointing out that the cost to list a directory is the same as to get details on a single key for small listings (<1000s of entries), so there would be no good reason not to, unless you expect the bucket contents to be volatile. |
Ref fsspec/filesystem_spec#216 |
The simplest change looks like this --- a/s3fs/core.py
+++ b/s3fs/core.py
@@ -451,7 +451,7 @@ class S3FileSystem(AbstractFileSystem):
raise ValueError("version_id cannot be specified if the "
"filesystem is not version aware")
kwargs['VersionId'] = version_id
- if self.version_aware:
+ if self.version_aware or self._ls_from_cache(path) is None:
try:
bucket, key = self.split_path(path)
out = self._call_s3(self.s3.head_object, kwargs, Bucket=bucket, |
Hi @martindurant -- could you give some guidance on the best way forward to address this issue? I'm keen to fix this, since it has major performance impacts on datasets stored in S3. |
Can you try with the small change above? I am prepared to include it in the codebase, but we still do need a more rigorous way to deal with this. |
Hi Martin...Sorry I did not get a chance to test this before leaving for vacation last week. Thanks for your quick fix. |
pleasure :) |
I do not like how s3fs implicitly lists directories. This can lead to extreme performance degradation for directories with large numbers of objects. Here is a simple example
The reason the second version is so slow is that fsspec is listing the directory contents, even though I know exactly which object I would like. Note that this behavior is unchanged if I add
check=False
.Explicit is better than implicit is a a key part of the Zen of python. I would really prefer if this directory listing / caching had to be enabled explicitly, with the default behavior being simply passing through the objects with as little overhead as possible.
This issue is also raised in #279.
cc @cgentemann
The text was updated successfully, but these errors were encountered: