Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray.open_zarr to be deprecated #70

Closed
Mikejmnez opened this issue Apr 29, 2020 · 9 comments
Closed

xarray.open_zarr to be deprecated #70

Mikejmnez opened this issue Apr 29, 2020 · 9 comments

Comments

@Mikejmnez
Copy link

In the future, zarr files/stores will be opened with xarray.open_dataset as follows

ds = xarray.open_dataset(store, engine="zarr", ...)

Thus, (eventually) there needs to be a change on how intake-xarray will open zarr stores. Naively, this can be done by specifying open_dataset rather than open_zarr. This is, something like:

store = 'directoryA/subdirectory/zarr_store
self._mapper = fsspec.get_mapper(store, ...) 
_open = xr.open_dataset
self._ds = open(self._mapper, engine="zarr", ...)

However, xarray.open_dataset does not recognize output from fsspec.get_mapper. It works if store (as defined above) is passed.

On xarray.open_zarr, _mapper gets transformed into a ZarrStore and later decoded. This is, given the _mapper, the following will open the zarr store:

from xarray.backends.zarr import ZarrStore
from xarray import conventions
zarr_store = ZarrStore.open_group(_mapper, ...)
ds = conventions.decode_cf(zarr_store,...)

This brings two options IMHO:

  1. Drop using fsspec.get_mapper (not likely) and just pass the url/path as argument to xarray.open_dataset. (very unlikely), or
  2. Follow along the lines of the pseudo code above, and rather than import/use xarray.open_dataset directly, import ZarrStore.open_group and convenctions.decode_cf to open zarr stores.

It is my understanding, that zarr will potentially depend more on fsspec as it gets more developed, and thus No. 2 seems more likely.

Or, is there another secret option number 3 I fail to see?

pydata/xarray#4003
fsspec/filesystem_spec#286
zarr-developers/zarr-python#546

@martindurant
Copy link
Member

zarr-developers/zarr-python#546 will add the get_mapper route into zarr itself, so that you can pass URL and storage_options. This may take a while to merge and release, though.

However, I am surprised: I would think that with engine=zarr, xarray would pass through the mapper object to zarr, which knows of course what to do with it.

Is the CF decode the only step that xarray is adding?

@Mikejmnez
Copy link
Author

Good to know about the zarr development route. I can't wait to see the new behavior, although I understand the timeline...

Here is a better description of the problem with fs.get_mapper on intake-xarray

store = 'directoryA/subdirB/zarr_store'
_mapper = fsspec.get_mapper(store)
ds = xarray.open_dataset(_mapper, chunks="auto",...)

raises the error:

ValueError: can only read bytes of file-like objects with engine="scipy" or "h5netcdf"

Now, engine="zarr" is in the list of engines, the problem again is with _mapper within xarray.open_dataset. In xarray, the instance file_or_obj=_mapper needs to be recognized as either str, Path or AbstractDatastore in order for the open_dataset arguments to be passed to ZarrStore.open_group. This is where it fails. That is, all of

In []: isinstance(_mapper, str)
out []: False
In []: isinstance(_mapper, Path)
Out []: False
In []: isinstance(_mapper, AbstractDatastore)
Out []: False

At least one of the above needs to be True, or the ValueError above get triggered.

@martindurant
Copy link
Member

Right - I would think that when engine="zarr", there should be an additional case isinstance(obj, MutableMapping) after the datastore check (which is also, presumably, dict-like).

@Mikejmnez
Copy link
Author

Yes, that does it. That is

In []: from collections.abc import MutableMapping
In []: isinstance(mapper, MutableMapping)
Out []: True

Thanks @martindurant !

@Mikejmnez
Copy link
Author

With this sorted (and working), should I submit a PR to set Intake-xarray to use open_dataset when opening zarr stores (i have a working fork)?

@martindurant
Copy link
Member

Presumably yes, after xarray is released

@weiji14
Copy link
Collaborator

weiji14 commented Sep 22, 2020

FYI, pydata/xarray#4187 was merged (a continuation of pydata/xarray#4003) and should be available for the next release of xarray (0.16.2?). Note that xr.open_zarr function was not deprecated, but is kept around as a short alias for xr.open_dataset(store, engine="zarr").

Still, it might be preferable to switch intake-xarray to the new syntax, as multiple Zarr files can now be opened with xarray using xr.open_mfdataset(stores, engine="zarr"). I see that zarr-developers/zarr-python#546 was merged recently too, so someone with time might want to try and see how these different pieces can be fit together to improve intake-xarray 😉.

@martindurant
Copy link
Member

Right, so I believe xr.open_dataset(URL, engine="zarr") should now work, and we should make xr.open_mfdataset(URLs, engine="zarr") work (where URLs can be a list, or a glob-string). Then theres no reason to be creating mappers in intake-xarray any more; as you say, it just needs a little effort to plumb through things like storage_options.

@Mikejmnez
Copy link
Author

Awesome, I can take a look in the next couple of days and see what I can do. It shouldn't be much trouble, I hope. Thanks a lot @weiji14!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants