Use automatic chunking in from_zarr #6419

mrocklin · 2020-07-17T03:39:31Z

Previously we would use whatever chunking was present in the zarr array.
However storage formats often store smaller chunks than are ideal for
compute frameworks, leading in too many tasks that are too small.

Now we just let da.from_array handle this. It already has logic to
choose a good chunksize that both respects the alignment within the
zarr array and the chunk size in configuration (usually 128 MiB by
default)

cc @alimanfoo @sofroniewn @tlambert03

Previously we would use whatever chunking was present in the zarr array. However storage formats often store smaller chunks than are ideal for compute frameworks, leading in too many tasks that are too small. Now we just let da.from_array handle this. It already has logic to choose a good chunksize that both respects the alignment within the zarr array and the chunk size in configuration (usually 128 MiB by default)

gforsyth · 2020-07-17T13:08:58Z

dask/array/core.py

@@ -2841,7 +2841,7 @@ def from_array(


 def from_zarr(
-    url, component=None, storage_options=None, chunks=None, name=None, **kwargs
+    url, component=None, storage_options=None, chunks="auto", name=None, **kwargs


The docstring entry for chunks should probably be updated to reflect this default

What would you pass to retain the previous behaviour?

You would get the chunking from the Zarr array and pass it in explicitly:

chunks=my_zarr_array.chunks

Or if you wanted the smaller chunk sizes you would specify a smaller chunk size, perhaps in bytes

chunks="1 MiB"

This is also all the same logic that we currently have for any other dataset that defines a chunks= attribute. I think that the default behavior is usually optimal.

Sure, but it may be worthwhile indicating in the docstring how to get the inherent chunking.

Can we keep chunks=None or something similar as an easy way to get the chunks on disk? It may not be easy to construct the my_zarr_array if the only has a URL, say.

We could. Can I ask for additional motivation though? We don't currently do this for HDF5 or NetCDF or any other format. Why would we do this for Zarr? Why do we care about the old behavior? I expect that adding docs on this is just as likely to lead people astray as it is to help them.

As a reminder, the automatic chunking is decently smart and we haven't ever gotten complaints about the choices that it makes, despite pretty heavy usage. It will find a chunking that aligns with the existing chunking in storage, but is mildly larger in other dimensions if necessary.

jsignell

@dask/maintenance this looks good to me!

TomAugspurger · 2020-07-20T15:22:00Z

Libraries like pangeo-data/rechunker want to have exact control over the chunk sizes so that they can reliably predict memory usage. I haven't verified that the changes here impact rechunker or libraries like it (and won't have time to near term), but this feels like the kind of thing that might have a negative impact.

…

On Mon, Jul 20, 2020 at 9:49 AM Matthew Rocklin ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In dask/array/core.py <#6419 (comment)>: > @@ -2841,7 +2841,7 @@ def from_array( def from_zarr( - url, component=None, storage_options=None, chunks=None, name=None, **kwargs + url, component=None, storage_options=None, chunks="auto", name=None, **kwargs We could. Can I ask for additional motivation though? We don't currently do this for HDF5 or NetCDF or any other format. Why would we do this for Zarr? Why do we care about the old behavior? I expect that adding docs on this is just as likely to lead people astray as it is to help them. As a reminder, the automatic chunking is decently smart and we haven't ever gotten complaints about the choices that it makes, despite pretty heavy usage. It will find a chunking that aligns with the existing chunking in storage, but is mildly larger in other dimensions if necessary. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#6419 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIULQN5OEGCJFNNHWELR4RKQXANCNFSM4O5N3PCQ> .

mrocklin · 2020-07-20T15:31:56Z

Libraries like pangeo-data/rechunker want to have exact control over the
chunk sizes so that they can reliably predict memory usage

They still have that exact control. They can specify chunks manually just like before and in from_array. The only thing that we've changed here is the default when the user doesn't provide any information. The result is that users will get larger chunks if their zarr arrays are chunked too finely to be optimal for Dask.

but this feels like the kind of thing that might have a negative impact.

FWIW I would be surprised to see this change have a negative impact on a typical workflow.

I haven't verified that the changes here impact rechunker or libraries like it (and won't have time to near term),

Is there someone that we can ping from that project who might be able to weigh in here?

jsignell · 2020-07-20T15:47:38Z

Maybe @rabernat could weigh in?

rabernat · 2020-07-20T16:02:07Z

I agree that the proposed changes would be useful in many cases. But I would prefer to see a warning + deprecation cycle, rather than a sudden change in the default behavior. I have quite a bit of code that relies on the assumption that from_zarr will return the native zarr chunks.

There also may be performance impacts. By using larger chunks than the on-disk chunks, we leave it to zarr to figure out how to read those batches of chunks. This currently occurs in a serial / blocking fashion. Switching to async in zarr (see zarr-developers/zarr-python#536), could speed that up a lot.

rabernat · 2020-07-20T16:06:01Z

They can specify chunks manually just like before and in from_array

If you are reading an array directly from a url or path, you don't necessarily know the native chunks. In that case, you have to open up the array first with zarr, examine the metadata, and then call from_zarr. It would be nice to have some kind of option to force dask to use the native chunks without the user explicitly specifying what they are.

Downstream, it would be good to think about how xarray could use this feature effectively.

mrocklin · 2020-07-20T16:08:28Z

I have quite a bit of code that relies on the assumption that from_zarr will return the native zarr chunks.

If you have time I'd be curious to learn more about situations where this would break things.

There also may be performance impacts. By using larger chunks than the on-disk chunks, we leave it to zarr to figure out how to read those batches of chunks. This currently occurs in a serial / blocking fashion.

My guess is that in most of these situations there are several chunks still active. Alternatively, if we wanted to ensure some concurrency then this is something that we could push up to the more general logic in from_array. We could encourage dask array to use the chunksize in configuration (which we do now) unless there are very few chunks, in which case it should slim down chunk sizes.

If you are reading an array directly from a url or path, you don't necessarily know the native chunks. In that case, you have to open up the array first with zarr, examine the metadata, and then call from_zarr. It would be nice to have some kind of option to force dask to use the native chunks without the user explicitly specifying what they are.

This is also the case with HDF/NetCDF today.

In both of these cases I would prefer that we not have special logic just for Zarr, but instead focus on the upstream from_array function so that other formats (hdf, netcdf, tiledb, ...) benefit and we have consistent behavior across formats.

jakirkham · 2020-07-21T09:18:44Z

Have we tried this with a Zarr object (maybe from Pangeo)? Perhaps that would help identify relevant issues/alleviate concerns.

mrocklin · 2020-08-03T19:19:09Z

I have quite a bit of code that relies on the assumption that from_zarr will return the native zarr chunks.

Checking in here. Would it be possible to learn more about these issues? Recall that the heuristics in from_array do respect native storage chunk boundaries, they just sometimes include many of them in a single dask array chunk.

I don't think that we've ever run into an issue with this policy with HDF/NetCDF.

If we're going to block progress on this PR then I think it would be good to get more information on what exactly would cause the break.

mrocklin force-pushed the zarr-align-chunks branch from fbc1c42 to 776dfea Compare July 17, 2020 04:12

gforsyth reviewed Jul 17, 2020

View reviewed changes

defer docstring to from_array

52bb726

jsignell approved these changes Jul 20, 2020

View reviewed changes

Base automatically changed from master to main March 8, 2021 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use automatic chunking in from_zarr #6419

Use automatic chunking in from_zarr #6419

mrocklin commented Jul 17, 2020

gforsyth Jul 17, 2020

martindurant Jul 20, 2020

mrocklin Jul 20, 2020

mrocklin Jul 20, 2020

martindurant Jul 20, 2020

TomAugspurger Jul 20, 2020

mrocklin Jul 20, 2020

jsignell left a comment

TomAugspurger commented Jul 20, 2020 via email

mrocklin commented Jul 20, 2020

jsignell commented Jul 20, 2020

rabernat commented Jul 20, 2020

rabernat commented Jul 20, 2020

mrocklin commented Jul 20, 2020

jakirkham commented Jul 21, 2020

mrocklin commented Aug 3, 2020

Use automatic chunking in from_zarr #6419

Are you sure you want to change the base?

Use automatic chunking in from_zarr #6419

Conversation

mrocklin commented Jul 17, 2020

gforsyth Jul 17, 2020

Choose a reason for hiding this comment

martindurant Jul 20, 2020

Choose a reason for hiding this comment

mrocklin Jul 20, 2020

Choose a reason for hiding this comment

mrocklin Jul 20, 2020

Choose a reason for hiding this comment

martindurant Jul 20, 2020

Choose a reason for hiding this comment

TomAugspurger Jul 20, 2020

Choose a reason for hiding this comment

mrocklin Jul 20, 2020

Choose a reason for hiding this comment

jsignell left a comment

Choose a reason for hiding this comment

TomAugspurger commented Jul 20, 2020 via email

mrocklin commented Jul 20, 2020

jsignell commented Jul 20, 2020

rabernat commented Jul 20, 2020

rabernat commented Jul 20, 2020

mrocklin commented Jul 20, 2020

jakirkham commented Jul 21, 2020

mrocklin commented Aug 3, 2020