Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr consolidated #2559

Merged
merged 16 commits into from
Dec 4, 2018
Merged

Zarr consolidated #2559

merged 16 commits into from
Dec 4, 2018

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Nov 20, 2018

This PR adds support for reading and writing of consolidated metadata in zarr stores.

  • Closes how to incorporate zarr's new open_consolidated method? #2558 (remove if there is no corresponding issue, which should only be the case for minor changes)
  • Tests added (for all bug fixes or enhancements)
  • Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

@pep8speaks
Copy link

pep8speaks commented Nov 20, 2018

Hello @rabernat! Thanks for updating the PR.

Line 240:80: E501 line too long (82 > 79 characters)

Comment last updated on December 04, 2018 at 19:34 Hours UTC

xarray/backends/api.py Outdated Show resolved Hide resolved
@rabernat rabernat requested a review from jhamman November 20, 2018 04:41
@rabernat
Copy link
Contributor Author

Ping @lilyminium for a review.

if consolidate:
import zarr
zarr.consolidate_metadata(store)
# do we need to reload ztore now that we have consolidated?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense for zarr to handle this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant reloading the zarr store automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be hard to achieve. And I'm not sure it's necessary. Frankly I don't know why we return a store object from to_zarr at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarr.consolidate_metadata returns the output of open_consolidated on the same store, so this is already happening

@rabernat
Copy link
Contributor Author

Also need to add some version checks...this will only work with zarr > 2.2.

doc/io.rst Show resolved Hide resolved

def __init__(self, zarr_group):
if consolidated or consolidate_on_close:
if LooseVersion(zarr.__version__) <= '2.2': # pragma: no cover
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to update this version check too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being more explicit about the version seems to fix this issue here. In the tests I have used the importorskip approach.

@rabernat
Copy link
Contributor Author

Not sure I understand why there are tests failing now. The failing function is test_basic_compute.

https://travis-ci.org/pydata/xarray/jobs/460873430#L7489

At first glance, this does not appear to have anything to do with my PR. The relevant error is:


______________________________ test_basic_compute ______________________________
    def test_basic_compute():
        ds = Dataset({'foo': ('x', range(5)),
                      'bar': ('x', range(5))}).chunk({'x': 2})
        for get in [dask.threaded.get,
                    dask.multiprocessing.get,
                    dask.local.get_sync,
                    None]:
            with (dask.config.set(scheduler=get)
                  if LooseVersion(dask.__version__) >= LooseVersion('0.19.4')
                  else dask.config.set(scheduler=get)
                  if LooseVersion(dask.__version__) >= LooseVersion('0.18.0')
                  else dask.set_options(get=get)):
>               ds.compute()
xarray/tests/test_dask.py:843: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
xarray/core/dataset.py:597: in compute
    return new.load(**kwargs)
xarray/core/dataset.py:494: in load
    evaluated_data = da.compute(*lazy_data.values(), **kwargs)
../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/base.py:390: in compute
    collections=collections)
../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/base.py:865: in get_scheduler
    return get_scheduler(scheduler=config.get('scheduler', None))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
get = None, scheduler = <function get at 0x7fc31d9ae048>, collections = None
cls = None
    def get_scheduler(get=None, scheduler=None, collections=None, cls=None):
        """ Get scheduler function
    
        There are various ways to specify the scheduler to use:
    
        1.  Passing in get= parameters (deprecated)
        2.  Passing in scheduler= parameters
        3.  Passing these into global confiuration
        4.  Using defaults of a dask collection
    
        This function centralizes the logic to determine the right scheduler to use
        from those many options
        """
        if get is not None:
            if scheduler is not None:
                raise ValueError("Both get= and scheduler= provided.  Choose one")
            warn_on_get(get)
            return get
    
        if scheduler is not None:
>           if scheduler.lower() in named_schedulers:
E           AttributeError: 'function' object has no attribute 'lower'
../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/base.py:854: AttributeError

@shoyer
Copy link
Member

shoyer commented Nov 28, 2018

I bet this is due to the latest dask release (1.0). We can fix this in another PR.

@lilyminium
Copy link
Contributor

I remember dealing with this in my pull request -- if I recall correctly scheduler was pointing to the scheduler.get function instead. It was a minor bug that was either fixed in the next release of xarray (0.11.0) or Dask (0.20.1).

@rabernat
Copy link
Contributor Author

So if the test issues can be considered resolved, the only decision we need to make is about the API.

Do we prefer (the current way):

ds.to_zarr(fname, consolidate=True)
xr.open_zarr(fname, consolidated=True)

or @shoyer's suggestion

ds.to_zarr(fname, consolidated=True)
xr.open_zarr(fname, consolidated=True)

???

@martindurant
Copy link
Contributor

Will the default for both options be False for the time being?

@rabernat
Copy link
Contributor Author

Will the default for both options be False for the time being?

Yes

@martindurant
Copy link
Contributor

Glad to see this happening, by the way. Once in, catalogs using intake-xarray can be updated and I don't thin the code will need to change.

@alimanfoo
Copy link
Contributor

Great to see this. On the API, FWIW I'd vote for using the same keyword (consolidated) in both, less burden on the user to remember what to use.

@rabernat
Copy link
Contributor Author

Keywords are now all consolidated and all tests are go.

Ready to merge?

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is basically ready. I had a few small questions/comments but this looks safe for a merge here soon.

@@ -36,6 +36,8 @@ Breaking changes
Enhancements
~~~~~~~~~~~~

- Ability to read and write consolidated metadata in zarr stores.
By `Ryan Abernathey <https://github.com/rabernat>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reference the issue this is attached to: (:issue:`2558`).


open_kwargs = dict(mode=mode, synchronizer=synchronizer, path=group)
if consolidated:
# TODO: an option to pass the metadata_key keyword
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to consider this TODO here?


open_kwargs = dict(mode=mode, synchronizer=synchronizer, path=group)
if consolidated:
# TODO: an option to pass the metadata_key keyword
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything to do here now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we feel that it's important to expose this functionality from within xarray? I don't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't.
I think it's ok for xarray to have an opinion on what the special key is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose we just leave these TODO's here as is. If anyone ever needs this feature from the xarray side, this will help guide them on how to implement it.

@martindurant
Copy link
Contributor

LGTM

Do you think there should be more explicit text of how to add consolidation to existing zarr/xarray data-sets, rather than creating them with consolidation turned on?

We may also need some text around updating consolidated data-sets, but that can maybe wait to see what kind of usage people try.

@rabernat
Copy link
Contributor Author

rabernat commented Dec 4, 2018

We may also need some text around updating consolidated data-sets, but that can maybe wait to see what kind of usage people try.

Since xarray cannot append or modify in-place existing zarr stores, this seems outside the scope of xarray for now. But maybe it is worth mentioning in the docs.

@jhamman
Copy link
Member

jhamman commented Dec 4, 2018

I'm happy here. ...but Appveyor is not.

@shoyer
Copy link
Member

shoyer commented Dec 4, 2018

@rabernat if you're ready, let's merge this.

The failures on Appveyor are unrelated (an issue with int32 and cftime)

@rabernat
Copy link
Contributor Author

rabernat commented Dec 4, 2018 via email

@shoyer shoyer merged commit 3ae93ac into pydata:master Dec 4, 2018
@rabernat
Copy link
Contributor Author

rabernat commented Dec 5, 2018

If anyone wants to see how awesome consolidated metadata is, you can try it in this binder:
https://github.com/rabernat/pangeo_ecco_examples/

I did a bit of lazy profiling here:
https://gist.github.com/rabernat/ce1fb414cf53541afe2245363b06c49d

Things that used to take ~40s now take ~1s. Especially since loading the data is one of the first steps in any pangeo notebook, this is a huge improvement in usability.

Thanks to everyone who helped make it happen!

@martindurant
Copy link
Contributor

I like those timings.

dcherian pushed a commit to yohai/xarray that referenced this pull request Dec 16, 2018
* upstream/master:
  Feature: N-dimensional auto_combine (pydata#2553)
  Support HighLevelGraphs (pydata#2603)
  Bump cftime version in doc environment (pydata#2604)
  use keep_attrs in binary operations II (pydata#2590)
  Temporarily mark dask-dev build as an allowed failure (pydata#2602)
  Fix wrong error message in interp() (pydata#2598)
  Add dayofyear and dayofweek accessors (pydata#2599)
  Fix h5netcdf saving scalars with filters or chunks (pydata#2591)
  Minor update to PR template (pydata#2596)
  Zarr consolidated (pydata#2559)
  fix examples (pydata#2581)
  Fix typo (pydata#2578)
  Concat docstring typo (pydata#2577)
  DOC: remove example using Dataset.T (pydata#2572)
  python setup.py test now works by default (pydata#2573)
  Return slices when possible from CFTimeIndex.get_loc() (pydata#2569)
  DOC: fix computation.rst (pydata#2567)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

how to incorporate zarr's new open_consolidated method?
7 participants