You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(This isn't really a Pangeo-Forge issue, but it's something I realized recently that I think is quite relevant for storing typical Pangeo datasets)
The standard practice I've seen with Pangeo is to use relatively large chunk sizes, e.g., ~100 MB. This is similar to recommended chunk-sizes for computation with dask.array.
I think this is unnecessarily large, and you could get away with much smaller chunk-sizes as long as the underlying Zarr store is using concurrent/asynchronous IO, and multiple Zarr chunks fit into each Dask chunks. The bigger Dask chunks ensure that Dask's overhead is fixed (e.g., it doesn't build up unreasonably large task graphs), and the concurrent access to the Zarr store (within each chunk) ensures that data from Zarr can still be read at roughly the same speed.
For example, in the NOAA OISST example, you might keep the Zarr arrays smaller (e.g., one time-step per chunk for ~4 MB per array chunk), but simply open the dataset up with different chunks, e.g., xr.open_zarr(path, chunks={'time': '100MB'}).
Why would this be a good idea?
Smaller chunks means significantly reduced IO for previewing data and makes a big difference for interactive visualization.
Smaller chunks should reduce the need for rechunking for different types of analysis.
File systems and object stores can typically efficiently store much smaller chunks of data without overhead (~1-10 MB is a typical range) than dask likes for array chunks. So there's fair amount of addition flexibility available.
This mirrors the standard guidance to size thread-pools based on how they are used: executing many CPU bound tasks should use 1 thread/core, but IO bound tasks can make good use of many more threads.
The text was updated successfully, but these errors were encountered:
This is a good suggestion, and I agree with you in theory. This was one of the big motivations for pursuing async support in the first place (see discussion in zarr-developers/zarr-python#536).
However, the tradeoff is that a huge number of chunks translates to a huge number of files / objects to manage. For a 1 TB in 4 MB chunks, there are 250_000 chunks to keep track of! This creates a big overhead on the object store if you ever want to delete / move / rename the data.
(This isn't really a Pangeo-Forge issue, but it's something I realized recently that I think is quite relevant for storing typical Pangeo datasets)
The standard practice I've seen with Pangeo is to use relatively large chunk sizes, e.g., ~100 MB. This is similar to recommended chunk-sizes for computation with dask.array.
I think this is unnecessarily large, and you could get away with much smaller chunk-sizes as long as the underlying Zarr store is using concurrent/asynchronous IO, and multiple Zarr chunks fit into each Dask chunks. The bigger Dask chunks ensure that Dask's overhead is fixed (e.g., it doesn't build up unreasonably large task graphs), and the concurrent access to the Zarr store (within each chunk) ensures that data from Zarr can still be read at roughly the same speed.
For example, in the NOAA OISST example, you might keep the Zarr arrays smaller (e.g., one time-step per chunk for ~4 MB per array chunk), but simply open the dataset up with different chunks, e.g.,
xr.open_zarr(path, chunks={'time': '100MB'})
.Why would this be a good idea?
The text was updated successfully, but these errors were encountered: