Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider smaller Zarr chunks with async IO #89

Open
shoyer opened this issue Mar 31, 2021 · 1 comment
Open

Consider smaller Zarr chunks with async IO #89

shoyer opened this issue Mar 31, 2021 · 1 comment

Comments

@shoyer
Copy link

shoyer commented Mar 31, 2021

(This isn't really a Pangeo-Forge issue, but it's something I realized recently that I think is quite relevant for storing typical Pangeo datasets)

The standard practice I've seen with Pangeo is to use relatively large chunk sizes, e.g., ~100 MB. This is similar to recommended chunk-sizes for computation with dask.array.

I think this is unnecessarily large, and you could get away with much smaller chunk-sizes as long as the underlying Zarr store is using concurrent/asynchronous IO, and multiple Zarr chunks fit into each Dask chunks. The bigger Dask chunks ensure that Dask's overhead is fixed (e.g., it doesn't build up unreasonably large task graphs), and the concurrent access to the Zarr store (within each chunk) ensures that data from Zarr can still be read at roughly the same speed.

For example, in the NOAA OISST example, you might keep the Zarr arrays smaller (e.g., one time-step per chunk for ~4 MB per array chunk), but simply open the dataset up with different chunks, e.g., xr.open_zarr(path, chunks={'time': '100MB'}).

Why would this be a good idea?

  • Smaller chunks means significantly reduced IO for previewing data and makes a big difference for interactive visualization.
  • Smaller chunks should reduce the need for rechunking for different types of analysis.
  • File systems and object stores can typically efficiently store much smaller chunks of data without overhead (~1-10 MB is a typical range) than dask likes for array chunks. So there's fair amount of addition flexibility available.
  • This mirrors the standard guidance to size thread-pools based on how they are used: executing many CPU bound tasks should use 1 thread/core, but IO bound tasks can make good use of many more threads.
@rabernat
Copy link
Contributor

rabernat commented Apr 1, 2021

This is a good suggestion, and I agree with you in theory. This was one of the big motivations for pursuing async support in the first place (see discussion in zarr-developers/zarr-python#536).

However, the tradeoff is that a huge number of chunks translates to a huge number of files / objects to manage. For a 1 TB in 4 MB chunks, there are 250_000 chunks to keep track of! This creates a big overhead on the object store if you ever want to delete / move / rename the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants