-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use automatic chunking in from_zarr #6419
base: main
Are you sure you want to change the base?
Conversation
Previously we would use whatever chunking was present in the zarr array. However storage formats often store smaller chunks than are ideal for compute frameworks, leading in too many tasks that are too small. Now we just let da.from_array handle this. It already has logic to choose a good chunksize that both respects the alignment within the zarr array and the chunk size in configuration (usually 128 MiB by default)
fbc1c42
to
776dfea
Compare
@@ -2841,7 +2841,7 @@ def from_array( | |||
|
|||
|
|||
def from_zarr( | |||
url, component=None, storage_options=None, chunks=None, name=None, **kwargs | |||
url, component=None, storage_options=None, chunks="auto", name=None, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring entry for chunks
should probably be updated to reflect this default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you pass to retain the previous behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would get the chunking from the Zarr array and pass it in explicitly:
chunks=my_zarr_array.chunks
Or if you wanted the smaller chunk sizes you would specify a smaller chunk size, perhaps in bytes
chunks="1 MiB"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also all the same logic that we currently have for any other dataset that defines a chunks=
attribute. I think that the default behavior is usually optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but it may be worthwhile indicating in the docstring how to get the inherent chunking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep chunks=None
or something similar as an easy way to get the chunks on disk? It may not be easy to construct the my_zarr_array
if the only has a URL, say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. Can I ask for additional motivation though? We don't currently do this for HDF5 or NetCDF or any other format. Why would we do this for Zarr? Why do we care about the old behavior? I expect that adding docs on this is just as likely to lead people astray as it is to help them.
As a reminder, the automatic chunking is decently smart and we haven't ever gotten complaints about the choices that it makes, despite pretty heavy usage. It will find a chunking that aligns with the existing chunking in storage, but is mildly larger in other dimensions if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dask/maintenance this looks good to me!
Libraries like pangeo-data/rechunker want to have exact control over the
chunk sizes so that they can reliably predict memory usage. I haven't
verified that the changes here impact rechunker or libraries like it (and
won't have time to near term), but this feels like the kind of thing that
might have a negative impact.
…On Mon, Jul 20, 2020 at 9:49 AM Matthew Rocklin ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dask/array/core.py
<#6419 (comment)>:
> @@ -2841,7 +2841,7 @@ def from_array(
def from_zarr(
- url, component=None, storage_options=None, chunks=None, name=None, **kwargs
+ url, component=None, storage_options=None, chunks="auto", name=None, **kwargs
We could. Can I ask for additional motivation though? We don't currently
do this for HDF5 or NetCDF or any other format. Why would we do this for
Zarr? Why do we care about the old behavior? I expect that adding docs on
this is just as likely to lead people astray as it is to help them.
As a reminder, the automatic chunking is decently smart and we haven't
ever gotten complaints about the choices that it makes, despite pretty
heavy usage. It will find a chunking that aligns with the existing chunking
in storage, but is mildly larger in other dimensions if necessary.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#6419 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIULQN5OEGCJFNNHWELR4RKQXANCNFSM4O5N3PCQ>
.
|
They still have that exact control. They can specify
FWIW I would be surprised to see this change have a negative impact on a typical workflow.
Is there someone that we can ping from that project who might be able to weigh in here? |
Maybe @rabernat could weigh in? |
I agree that the proposed changes would be useful in many cases. But I would prefer to see a warning + deprecation cycle, rather than a sudden change in the default behavior. I have quite a bit of code that relies on the assumption that There also may be performance impacts. By using larger chunks than the on-disk chunks, we leave it to zarr to figure out how to read those batches of chunks. This currently occurs in a serial / blocking fashion. Switching to async in zarr (see zarr-developers/zarr-python#536), could speed that up a lot. |
If you are reading an array directly from a url or path, you don't necessarily know the native chunks. In that case, you have to open up the array first with zarr, examine the metadata, and then call Downstream, it would be good to think about how xarray could use this feature effectively. |
If you have time I'd be curious to learn more about situations where this would break things.
My guess is that in most of these situations there are several chunks still active. Alternatively, if we wanted to ensure some concurrency then this is something that we could push up to the more general logic in
This is also the case with HDF/NetCDF today. In both of these cases I would prefer that we not have special logic just for Zarr, but instead focus on the upstream |
Have we tried this with a Zarr object (maybe from Pangeo)? Perhaps that would help identify relevant issues/alleviate concerns. |
Checking in here. Would it be possible to learn more about these issues? Recall that the heuristics in I don't think that we've ever run into an issue with this policy with HDF/NetCDF. If we're going to block progress on this PR then I think it would be good to get more information on what exactly would cause the break. |
Previously we would use whatever chunking was present in the zarr array.
However storage formats often store smaller chunks than are ideal for
compute frameworks, leading in too many tasks that are too small.
Now we just let da.from_array handle this. It already has logic to
choose a good chunksize that both respects the alignment within the
zarr array and the chunk size in configuration (usually 128 MiB by
default)
cc @alimanfoo @sofroniewn @tlambert03