Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharding array chunks across hashed sub-directories #115

Open
shoyer opened this issue May 9, 2021 · 4 comments
Open

Sharding array chunks across hashed sub-directories #115

shoyer opened this issue May 9, 2021 · 4 comments
Labels
protocol-extension Protocol extension related issue

Comments

@shoyer
Copy link

shoyer commented May 9, 2021

Consider the case where we want to concurrently store and read many array chunks (e.g., millions). This is inherently pretty reasonable with many distributed storage systems, but not with Zarr's default keys for chunks of the form {array_name}/{i}.{j}.{k}:

If we store a 10 TB array as a million 10 MB files in a single directory with sequential names, it would violate all of these guidelines!

It seems like a better strategy would be to store array chunks (and possibly other Zarr keys) across multiple sub-directories, where the sub-directory name is some apparently random but deterministic function of the keys, e.g., of the form {array_name}/{hash_value}/{i}.{j}.{k} or {hash_value}/{array_name}/{i}.{j}.{k}, where hash_value is produced by applying any reasonable hash function to the original key {array_name}/{i}.{j}.{k}.

The right number of hash buckets would depend on the performance characteristics of the underlying storage system. But the key feature is that the random prefixes/directory names make it easier to shard load, and avoid the typical performance bottleneck of reading/writing a bunch of nearby keys at the same time.

Ideally the specific hashing function / naming scheme (including the number of buckets) would be stored as part of the Zarr metadata in a standard way, so as to facilitate reading/writing data with different implementations.

Any thoughts? Has anyone considered these sort of solutions, or encountered these scaling challenges in the wild? I don't quite have a strong need for this feature yet, but I imagine I may soon.

@shoyer shoyer changed the title Sharding array chunks across many directories Sharding array chunks across hashed sub-directories May 9, 2021
@joshmoore
Copy link
Member

See the extended conversation starting at https://gitter.im/zarr-developers/community?at=601d471fc83ec358be27944f

Quick summary: @d-v-b points out that at least for S3 the nested storage strategy that has been put in place suffices to achieve the "5500 GET requests per second per prefix in a bucket" described under https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html

i.e. my hope with dimension_separator (zarr-developers/zarr-python#715) was certainly to achieve just that (on S3), but I would definitely not be surprised to learn that that doesn't hold for all storage backends.

@shoyer
Copy link
Author

shoyer commented May 13, 2021

@joshmoore thanks for the pointers!

I agree, nested storage solves many of these issues. In fact, we have already been using it in some cases to solve exactly this problem.

My main concern is that it only works well if your arrays have many chunks along multiple dimensions. But it's not uncommon to have most or all your chunks along a single dimension, e.g., a bunch of images stacked only along the "time" dimension. In these cases, you would still end up with either a very large number of sequential sub-directories or filenames, depending on dimension order.

Hashing seems like a more comprehensive fix.

@shoyer
Copy link
Author

shoyer commented Jun 2, 2021

As a point of reference, it looks like Neuroglancer & TensorStore align chunk file-names via a "compressed morton code":
https://github.com/google/neuroglancer/blob/v2.22/src/neuroglancer/datasource/precomputed/volume.md#sharded-chunk-storage

@jakirkham
Copy link
Member

This has some similarities with the proposal in issue ( #82 )

@jstriebel jstriebel added the protocol-extension Protocol extension related issue label Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests

4 participants