Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray open_mfdataset with engine Zarr #4187

Merged
merged 105 commits into from
Sep 22, 2020
Merged

Conversation

weiji14
Copy link
Contributor

@weiji14 weiji14 commented Jun 30, 2020

Work on enabling xr.open_dataset(..., engine="zarr"), to replace complement xr.open_zarr. This also allows `xr.open_mfdataset(..., engine="zarr") to be used.

Note: Credit should be given to @Mikejmnez, I'm just continuing this on from #4003.

@dcherian dcherian requested a review from rabernat August 21, 2020 15:32
@shoyer
Copy link
Member

shoyer commented Sep 11, 2020

Just to note down a few things:

  1. The deprecated "auto_chunk" kwarg was removed
  2. open_zarr uses chunks="auto" by default, whereas open_dataset uses chunks=None (see my comment inline)

The different default chunk behaviour (point 2) is worth raising, and it might be best to postpone the deprecation of open_zarr until the next release, so that there's time to discuss what we want the default setting to be (None or auto).

These all sound good to me!

I agree that we shouldn't change the default behavior for open_dataset, and should keep open_zarr around for now -- there is no urgent need to deprecate it.

Comment on lines 594 to 615
if "auto_chunk" in kwargs:
auto_chunk = kwargs.pop("auto_chunk")
if auto_chunk:
chunks = "auto" # maintain backwards compatibility
else:
chunks = None

warnings.warn(
"auto_chunk is deprecated. Use chunks='auto' instead.",
FutureWarning,
stacklevel=2,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I would suggest keeping this compatibility code and not (yet) marking open_zarr as deprecated.

Copy link
Contributor Author

@weiji14 weiji14 Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I can revert the open_zarr deprecation warning. Edit: Done at 40c4d46.

Should I also keep the "auto_chunk" compatibility here?

doc/io.rst Outdated Show resolved Hide resolved
@@ -487,10 +489,40 @@ def maybe_decode_store(store, lock=False):
)
name_prefix = "open_dataset-%s" % token
ds2 = ds.chunk(chunks, name_prefix=name_prefix, token=token)
ds2._file_obj = ds._file_obj

elif engine == "zarr":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with this code is that introducing an entirely separate code-path inside open_dataset() for chunking zarr in particular feels strange and a little unexpected. Any time we use totally separate code branches for some logic, the odds of introducing inconsistencies/bugs increases greatly.

I wonder if we could consolidate this logic somehow in order to avoid adding a separate branch for the code here? For example, we could put a get_chunk method on all xarray backends classes, even if it currently only returns a filler value and/or raises an error for chunks='auto'? Chunking is not unique to zarr, e.g., netCDF4 files also have chunks, although the default "auto" chunking logic should probably be different.

I would be OK holding this off for a later clean-up, but this really would be worth doing eventually. CC @alexamici RE: the backends refactor.

Copy link
Contributor Author

@weiji14 weiji14 Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with this code is that introducing an entirely separate code-path inside open_dataset() for chunking zarr in particular feels strange and a little unexpected. Any time we use totally separate code branches for some logic, the odds of introducing inconsistencies/bugs increases greatly.

I wonder if we could consolidate this logic somehow in order to avoid adding a separate branch for the code here? For example, we could put a get_chunk method on all xarray backends classes, even if it currently only returns a filler value and/or raises an error for chunks='auto'? Chunking is not unique to zarr, e.g., netCDF4 files also have chunks, although the default "auto" chunking logic should probably be different.

Thanks for pointing this out! I agree completely that the open_dataset() function is overdue for a refactor, it was a nightmare to go through all the if-then branches, but the comprehensive test suite helped to catch most of the bugs and I've tested it on my own real world dataset so the logic should be ok for now 🤞.

I would be OK holding this off for a later clean-up, but this really would be worth doing eventually. CC @alexamici RE: the backends refactor.

Personally I would prefer to hold this off, since this open_mfdataset PR (and the previous one at #4003) has been sitting around for months, and I've had to resolve quite a few merge conflicts to keep up. No point in contaminating this complex PR by refactoring the NetCDF logic either.

doc/whats-new.rst Outdated Show resolved Hide resolved
# auto chunking needs to be here and not in ZarrStore because
# the variable chunks does not survive decode_cf
# return trivial case
if not chunks: # e.g. chunks is 0, None or {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cam we make this if chunks is None instead?

I know this is a discrepancy from how open_zarr() works today, but currently open_dataset(..., chunks={}) is a way to open a dataset with dask chunks equal to the full size of any arrays.

I doubt the (different) behavior of open_zarr in this case was intentional....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try it locally and see if the tests break, felt like there was a reason it had to be not chunks but can't remember the context now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Not a big deal if we have to push off this clean-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, 2 tests would break with a ZeroDivisionError if we switch to if chunks is None. Specifically:

  • TestZarrDictStore.test_manual_chunk
  • TestZarrDirectoryStore.test_manual_chunk

Related to my comment hidden in the mess above at #4187 (comment) 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I would expect test_manual_chunk to fail here: it is explicitly verifying that chunks=0 and chunks={} result in-memory numpy arrays. Does it work if you remove those cases, e.g., by setting NO_CHUNKS = (None,) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, setting NO_CHUNKS = (None,) works with if chunks is None. I'll make the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it helps by the way, test_manual_chunk with NO_CHUNKS = (None, 0, {}) was added in #2530.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can go ahead and change that. It doesn't look like that was carefully evaluated in #2530.

chunk for all arrays.
chunk for all arrays. When using ``engine="zarr"`, if `chunks='auto'`,
dask chunks are created based on the variable's zarr chunks, and if
`chunks=None`, zarr array data will lazily convert to numpy arrays upon
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior for chunks=None is the same for all backends. The only special behavior for zarr is chunks='auto'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will update the docs.

Comment on lines 1618 to 1619
NO_CHUNKS = (None,)
for no_chunk in NO_CHUNKS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you remove the loop here now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I did think of that actually!

@shoyer shoyer merged commit 5654aee into pydata:master Sep 22, 2020
@shoyer
Copy link
Member

shoyer commented Sep 22, 2020

Thanks @weiji14 and @Mikejmnez for your contribution!

@martindurant
Copy link
Contributor

Note that zarr.open* now works with fsspec URLs (in master)

@dcherian
Copy link
Contributor

Thanks @weiji14 and @Mikejmnez . This is a great contribution.

dcherian added a commit to dcherian/xarray that referenced this pull request Oct 9, 2020
…pagate-attrs

* 'propagate-attrs' of github.com:dcherian/xarray: (22 commits)
  silence sphinx warnings about broken rst (pydata#4448)
  Xarray open_mfdataset with engine Zarr (pydata#4187)
  Fix release notes formatting (pydata#4443)
  fix typo in io.rst (pydata#4250)
  Fix typo (pydata#4181)
  Fix release notes typo
  New whatsnew section
  Add notes re doctests (pydata#4440)
  Fixed dask.optimize on datasets (pydata#4438)
  Release notes for 0.16.1 (pydata#4435)
  Small updates to How-to-release + lint (pydata#4436)
  Fix doctests (pydata#4439)
  add a ci for doctests (pydata#4437)
  preserve original dimension, coordinate and variable order in ``concat`` (pydata#4419)
  Fix for h5py deepcopy issues (pydata#4426)
  Keep the original ordering of the coordinates (pydata#4409)
  Clearer Vectorized Indexing example (pydata#4433)
  Revert "Fix optimize for chunked DataArray (pydata#4432)" (pydata#4434)
  Fix optimize for chunked DataArray (pydata#4432)
  fix doc dataarray to netcdf (pydata#4424)
  ...
@aurghs aurghs mentioned this pull request Oct 29, 2020
5 tasks
@aurghs aurghs mentioned this pull request Nov 19, 2020
5 tasks
@weiji14 weiji14 mentioned this pull request Feb 19, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

open_mfdataset: support for multiple zarr datasets
6 participants