Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use automatic chunking in from_zarr #6419

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mrocklin
Copy link
Member

Previously we would use whatever chunking was present in the zarr array.
However storage formats often store smaller chunks than are ideal for
compute frameworks, leading in too many tasks that are too small.

Now we just let da.from_array handle this. It already has logic to
choose a good chunksize that both respects the alignment within the
zarr array and the chunk size in configuration (usually 128 MiB by
default)

cc @alimanfoo @sofroniewn @tlambert03

Previously we would use whatever chunking was present in the zarr array.
However storage formats often store smaller chunks than are ideal for
compute frameworks, leading in too many tasks that are too small.

Now we just let da.from_array handle this.  It already has logic to
choose a good chunksize that both respects the alignment within the
zarr array and the chunk size in configuration (usually 128 MiB by
default)
@@ -2841,7 +2841,7 @@ def from_array(


def from_zarr(
url, component=None, storage_options=None, chunks=None, name=None, **kwargs
url, component=None, storage_options=None, chunks="auto", name=None, **kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring entry for chunks should probably be updated to reflect this default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you pass to retain the previous behaviour?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would get the chunking from the Zarr array and pass it in explicitly:

chunks=my_zarr_array.chunks

Or if you wanted the smaller chunk sizes you would specify a smaller chunk size, perhaps in bytes

chunks="1 MiB"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also all the same logic that we currently have for any other dataset that defines a chunks= attribute. I think that the default behavior is usually optimal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but it may be worthwhile indicating in the docstring how to get the inherent chunking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep chunks=None or something similar as an easy way to get the chunks on disk? It may not be easy to construct the my_zarr_array if the only has a URL, say.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. Can I ask for additional motivation though? We don't currently do this for HDF5 or NetCDF or any other format. Why would we do this for Zarr? Why do we care about the old behavior? I expect that adding docs on this is just as likely to lead people astray as it is to help them.

As a reminder, the automatic chunking is decently smart and we haven't ever gotten complaints about the choices that it makes, despite pretty heavy usage. It will find a chunking that aligns with the existing chunking in storage, but is mildly larger in other dimensions if necessary.

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dask/maintenance this looks good to me!

@TomAugspurger
Copy link
Member

TomAugspurger commented Jul 20, 2020 via email

@mrocklin
Copy link
Member Author

Libraries like pangeo-data/rechunker want to have exact control over the
chunk sizes so that they can reliably predict memory usage

They still have that exact control. They can specify chunks manually just like before and in from_array. The only thing that we've changed here is the default when the user doesn't provide any information. The result is that users will get larger chunks if their zarr arrays are chunked too finely to be optimal for Dask.

but this feels like the kind of thing that might have a negative impact.

FWIW I would be surprised to see this change have a negative impact on a typical workflow.

I haven't verified that the changes here impact rechunker or libraries like it (and won't have time to near term),

Is there someone that we can ping from that project who might be able to weigh in here?

@jsignell
Copy link
Member

Maybe @rabernat could weigh in?

@rabernat
Copy link
Contributor

I agree that the proposed changes would be useful in many cases. But I would prefer to see a warning + deprecation cycle, rather than a sudden change in the default behavior. I have quite a bit of code that relies on the assumption that from_zarr will return the native zarr chunks.

There also may be performance impacts. By using larger chunks than the on-disk chunks, we leave it to zarr to figure out how to read those batches of chunks. This currently occurs in a serial / blocking fashion. Switching to async in zarr (see zarr-developers/zarr-python#536), could speed that up a lot.

@rabernat
Copy link
Contributor

They can specify chunks manually just like before and in from_array

If you are reading an array directly from a url or path, you don't necessarily know the native chunks. In that case, you have to open up the array first with zarr, examine the metadata, and then call from_zarr. It would be nice to have some kind of option to force dask to use the native chunks without the user explicitly specifying what they are.

Downstream, it would be good to think about how xarray could use this feature effectively.

@mrocklin
Copy link
Member Author

I have quite a bit of code that relies on the assumption that from_zarr will return the native zarr chunks.

If you have time I'd be curious to learn more about situations where this would break things.

There also may be performance impacts. By using larger chunks than the on-disk chunks, we leave it to zarr to figure out how to read those batches of chunks. This currently occurs in a serial / blocking fashion.

My guess is that in most of these situations there are several chunks still active. Alternatively, if we wanted to ensure some concurrency then this is something that we could push up to the more general logic in from_array. We could encourage dask array to use the chunksize in configuration (which we do now) unless there are very few chunks, in which case it should slim down chunk sizes.

If you are reading an array directly from a url or path, you don't necessarily know the native chunks. In that case, you have to open up the array first with zarr, examine the metadata, and then call from_zarr. It would be nice to have some kind of option to force dask to use the native chunks without the user explicitly specifying what they are.

This is also the case with HDF/NetCDF today.

In both of these cases I would prefer that we not have special logic just for Zarr, but instead focus on the upstream from_array function so that other formats (hdf, netcdf, tiledb, ...) benefit and we have consistent behavior across formats.

@jakirkham
Copy link
Member

Have we tried this with a Zarr object (maybe from Pangeo)? Perhaps that would help identify relevant issues/alleviate concerns.

@mrocklin
Copy link
Member Author

mrocklin commented Aug 3, 2020

I have quite a bit of code that relies on the assumption that from_zarr will return the native zarr chunks.

Checking in here. Would it be possible to learn more about these issues? Recall that the heuristics in from_array do respect native storage chunk boundaries, they just sometimes include many of them in a single dask array chunk.

I don't think that we've ever run into an issue with this policy with HDF/NetCDF.

If we're going to block progress on this PR then I think it would be good to get more information on what exactly would cause the break.

Base automatically changed from master to main March 8, 2021 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants