Zarr chunking fixes #5065

rabernat · 2021-03-22T01:35:22Z

Closes zarr and xarray chunking compatibility and to_zarr performance #2300, closes Allow "unsafe" mode for zarr writing #5056
Tests added
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst

This PR contains two small, related updates to how Zarr chunks are handled.

We now delete the encoding attribute at the Variable level whenever chunk is called. The persistence of chunk encoding has been the source of lots of confusion (see zarr and xarray chunking compatibility and to_zarr performance #2300, automatic chunking of zarr archive #4046, Error when rechunking from Zarr store #4380, Writing to zarr fails with message "specified zarr chunks would overlap multiple dask chunks" xcube-dev/xcube#347)
Added a new option called safe_chunks in to_zarr which allows for bypassing the requirement of the many-to-one relationship between Zarr chunks and Dask chunks (see Allow "unsafe" mode for zarr writing #5056).

Both these touch the internal logic for how chunks are handled, so I thought it was easiest to tackle them with a single PR.

pep8speaks · 2021-03-22T01:35:40Z

Hello @rabernat! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-26 14:38:56 UTC

rabernat · 2021-03-22T01:58:23Z

Confused about the test error. It seems unrelated. In test_sparse.py:test_variable_method

E   TypeError: no implementation found for 'numpy.allclose' on types that implement __array_function__: [<class 'numpy.ndarray'>, <class 'sparse._coo.core.COO'>]

andersy005 · 2021-03-22T04:38:47Z

Confused about the test error. It seems unrelated. In test_sparse.py:test_variable_method
E   TypeError: no implementation found for 'numpy.allclose' on types that implement __array_function__: [<class 'numpy.ndarray'>, <class 'sparse._coo.core.COO'>]

Related to #5059, and it appears that @keewis came up with a fix for it in #5059 (comment)

rabernat · 2021-03-22T13:12:45Z

Thanks Anderson. Fixed by rebasing. Now RTD build is failing, but there is no obvious error in the logs...

xarray/backends/zarr.py

dcherian

Thanks @rabernat I only have some docstring suggestions

xarray/core/dataset.py

dcherian · 2021-03-24T15:42:54Z

xarray/backends/zarr.py

                    )
+                    if safe_chunks:
+                        raise ValueError(


really minor comment: Shouldn't this still be NotImplementedError since we could technically support this by implementing locks?

It used to be, but I changed it! Do we ever plan to implement locks?

doc/whats-new.rst

dcherian · 2021-03-24T15:45:09Z

xarray/core/variable.py

@@ -1091,6 +1092,10 @@ def chunk(self, chunks={}, name=None, lock=False):

            data = da.from_array(data, chunks, name=name, lock=lock, **kwargs)

+        # rechunking erases encoding
+        if self._encoding and "chunks" in self._encoding:
+            del self._encoding["chunks"]


we should mention this in the docstring for DataArray.chunk Dataset.chunk and Variable.chunk

shoyer · 2021-03-24T20:10:19Z

I'm a little conflicted about dealing with encoding['chunks'] specifically in chunk():

On one hand, it feels inconsistent for this only this single method in xarray to modify part of encoding. Nothing else in xarray (after CF decoding) does this. Effectively encoding['chunks'] is now becoming a part of xarray's data model.
On the other hand, this would absolutely fix a recurrent pain-point for users, and in that sense it's worth doing.

Maybe this isn't such a big deal in this particular case, especially if we don't think we would need to add such encoding specific logic to any other methods. But are we really sure about that -- what about cases like indexing?

I guess the other alternative to make chunk() and various other methods that would change chunking drop encoding entirely. I don't know if this would really be a better comprehensive solution (I know dropping attrs is much hated), but at least it's an easier mental model.

rabernat · 2021-03-25T13:17:03Z

I see your point. I guess I don't fully understand where else in the code path encoding gets dropped. Consider this example

import xarray as xr
ds = xr.Dataset({'foo': ('time', [1, 1], {'dtype': 'int16'})})
ds = xr.decode_cf(ds).compute()
assert "dtype" in ds.foo.encoding
assert "dtype" not in (0.5 * ds.foo).encoding

Xarray knows to drop the dtype encoding after an arithmetic operation. How does that work? To me .chunk feel like a similar case: an operation that invalidates any existing encoding.

shoyer · 2021-03-25T17:08:09Z

Xarray knows to drop the dtype encoding after an arithmetic operation. How does that work? To me .chunk feel like a similar case: an operation that invalidates any existing encoding.

To be honest, the existing convention is quite adhoc, just based on what seemed most appropriate at the time.

#1614 is most comprehensive description of the current state of things.

We were considering saying that attrs and encoding should always use the same rules, but perhaps we should be more aggressive about dropping encoding.

dcherian · 2021-03-25T17:13:55Z

Xarray knows to drop the dtype encoding after an arithmetic operation. How does that work?

There's a subtle difference. It drops all of .encoding not dtype specifically.

@shoyer's point about indexing changing chunking is a good one too. Perhaps a kwarg in to_zarr like ignore_encoding_chunks?

rabernat · 2021-03-25T17:19:15Z

Perhaps a kwarg in to_zarr like ignore_encoding_chunks?

I would argue that this is unnecessary. If you want to explicitly drop encoding, just del da.encoding['chunks'] before writing. But most users don't figure out that they should do this, because the default behavior is counterintuitive.

The problem here is with the default behavior of propagating chunk encoding through computations when it no longer makes sense. My example with the dtype encoding illustrates that we already drop encoding on certain operations, so it's not unprecedented. It's more of an implementation question: where and how to do the dropping.

FWIW, I would also favor dropping encoding['chunks'] after indexing, coarsening, interpolating, etc. Basically anything that changes the array shape or chunk structure.

shoyer · 2021-03-25T17:26:46Z

FWIW, I would also favor dropping encoding['chunks'] after indexing, coarsening, interpolating, etc. Basically anything that changes the array shape or chunk structure.

We already drop all of encoding after indexing. My guess is that we do the same for coarsening and interpolations as well (though I haven't checked).

aurghs · 2021-03-26T17:34:44Z

Perhaps we could remove also overwrite_encoded_chunks, it shouldn't be any more necessary.

rabernat · 2021-03-31T15:16:37Z

I appreciate the discussion on this PR. Does anyone have a concrete suggestion of what to do?

If we are not in agreement about the encoding stuff, perhaps I should remove that and just move forward with the safe_chunks part of this PR?

rabernat · 2021-03-31T16:12:13Z

In today's dev call, we proposed to handle encoding in chunk the same way we handle it in indexing: by deleting all encoding.

The problem is, I can't figure out where this happens. Can someone point me to the place in the code where indexing operations delete encoding?

A related question: I discovered this encoding option preferred_chunks, which is treated specially:

xarray/xarray/core/dataset.py

Line 396 in 57a4479

preferred_chunks = var.encoding.get("preferred_chunks", {})

Should the Zarr backend be setting this?

aurghs · 2021-03-31T16:20:30Z

Should the Zarr backend be setting this?

Yes, they are already defined in zarr: preferred_chunks=chunks. We decide to separate the chunks and the preferred_chunks:

The preferred_chunks is used by the backend to define the default chunks to be used by xarray.
The chunks are the on-disk chunks.

They are not necessarily the same.
Maybe we can drop the preferred_chunks after they are used.

aurghs · 2021-03-31T16:27:05Z

~~rechunk~~ Variable.chunk is used always when you open a data with dask, even if you are using the default chunking. So in this way, you will drop the encoding always when dask is used (≈ always).

dcherian · 2021-03-31T17:12:55Z

The problem is, I can't figure out where this happens.

Replace self._encoding with None here?

xarray/xarray/core/variable.py

Line 1084 in ddc352f

return type(self)(self.dims, data, self._attrs, self._encoding, fastpath=True)

rabernat · 2021-03-31T17:17:07Z

Replace self._encoding with None here?

Thanks! Yeah that's what I had in mind. But I was wondering if there was an example of doing that it else I could copy.

In any case, I'll give it a try now.

rabernat · 2021-03-31T17:31:53Z

A just pushed a new commit which deletes all encoding inside variable.chunk(). But as you will see when the CI finishes, this leads to a lot of test failures. For example:

=============================================================================== FAILURES ================================================================================
____________________________________________________ TestNetCDF4ViaDaskData.test_roundtrip_string_encoded_characters ____________________________________________________

self = <xarray.tests.test_backends.TestNetCDF4ViaDaskData object at 0x18cba4c40>

    def test_roundtrip_string_encoded_characters(self):
        expected = Dataset({"x": ("t", ["ab", "cdef"])})
        expected["x"].encoding["dtype"] = "S1"
        with self.roundtrip(expected) as actual:
            assert_identical(expected, actual)
>           assert actual["x"].encoding["_Encoding"] == "utf-8"
E           KeyError: '_Encoding'

/Users/rpa/Code/xarray/xarray/tests/test_backends.py:485: KeyError

Why is chunk getting called here? Does it actually get called every time we load a dataset with chunks? If so, we will need a more sophisticated solution.

aurghs · 2021-03-31T17:45:29Z

Does it actually get called every time we load a dataset with chunks?

Yes

rabernat · 2021-03-31T18:23:03Z

So any ideas how to proceed? 🧐

shoyer · 2021-03-31T20:54:46Z

Hmm. I would also be happy with explicitly deleting chunks from encoding for now. It's not adding a lot of technical debt.

In the long term, the whole handling of encoding should be revisited, e.g., see #5082

shoyer · 2021-03-31T21:31:07Z

xarray/core/variable.py

+        new_encoding = None  # rechunking removes all encoding
+        return type(self)(self.dims, data, self._attrs, new_encoding, fastpath=True)


a simpler way to achieve the same thing is just to omit the argument:

Suggested change

new_encoding = None # rechunking removes all encoding

return type(self)(self.dims, data, self._attrs, new_encoding, fastpath=True)

return type(self)(self.dims, data, self._attrs, fastpath=True)

shoyer · 2021-03-31T21:35:11Z

Why is chunk getting called here? Does it actually get called every time we load a dataset with chunks? If so, we will need a more sophisticated solution.

This happens specifically on this line:

xarray/xarray/core/dataset.py

Line 438 in ddc352f

var = var.chunk(chunks, name=name2, lock=lock)

So perhaps it would make sense to copy encoding specifically in this case, e.g.,

        new_var = var.chunk(chunks, name=name2, lock=lock)
        new_var.encoding = var.encoding

rabernat · 2021-04-07T15:44:25Z

I have removed the controversial encoding['chunks'] stuff from the PR. Now it only contains the safe_chunks option in to_zarr.

If there are no further comments on this, I think this is good to go.

rabernat · 2021-04-12T17:27:28Z

Any further feedback on this now reduced-scope PR? Merging this would be helpful for moving forward Pangeo forge.

keewis

A few documentation issues, but otherwise looks good to me. I don't know a lot about chunking and zarr, though.

xarray/backends/zarr.py

xarray/core/dataset.py

xarray/backends/zarr.py

dcherian

Thanks @rabernat

xarray/backends/zarr.py

Co-authored-by: keewis <keewis@users.noreply.github.com>

rabernat · 2021-04-26T14:38:49Z

The pre-commit workflow is raising a blackdoc error I am not seeing in my local env

diff --git a/doc/internals/duck-arrays-integration.rst b/doc/internals/duck-arrays-integration.rst
index eb5c4d8..2bc3c1f 100644
--- a/doc/internals/duck-arrays-integration.rst
+++ b/doc/internals/duck-arrays-integration.rst
@@ -25,7 +25,7 @@ argument:
         ...
 
         def _repr_inline_(self, max_width):
-            """ format to a single line with at most max_width characters """
+            """format to a single line with at most max_width characters"""
             ...

keewis · 2021-04-26T15:00:59Z

the reason is that black released a new version yesterday, and since we don't pin black for the blackdoc entry we get the new version. If you run pre-commit clean before pre-commit run --all-files you should see this change locally, too. To avoid situations like these we could start pinning black in the blackdoc entry (and run a script to synchronize this with the black entry on autoupdate).

rabernat · 2021-04-26T15:08:43Z

I think this PR has received a very thorough review. I would be pleased if someone from @pydata/xarray would merge it soon.

dcherian · 2021-04-26T16:37:37Z

Thanks @rabernat

rabernat force-pushed the zarr-chunk-fixes branch from 0a0b29d to bbd683d Compare March 22, 2021 11:56

keewis reviewed Mar 22, 2021

View reviewed changes

xarray/backends/zarr.py Outdated Show resolved Hide resolved

rabernat mentioned this pull request Mar 22, 2021

Write with Zarr pangeo-forge/pangeo-forge-recipes#86

Merged

rabernat requested review from shoyer and dcherian March 24, 2021 14:29

dcherian reviewed Mar 24, 2021

View reviewed changes

shoyer mentioned this pull request Mar 27, 2021

Move encoding from xarray.Variable to duck arrays? #5082

Open

shoyer reviewed Mar 31, 2021

View reviewed changes

keewis reviewed Apr 12, 2021

View reviewed changes

xarray/backends/zarr.py Outdated Show resolved Hide resolved

xarray/backends/zarr.py Show resolved Hide resolved

xarray/core/dataset.py Show resolved Hide resolved

xarray/backends/zarr.py Outdated Show resolved Hide resolved

xarray/backends/zarr.py Outdated Show resolved Hide resolved

dcherian approved these changes Apr 19, 2021

View reviewed changes

xarray/backends/zarr.py Outdated Show resolved Hide resolved

xarray/backends/zarr.py Outdated Show resolved Hide resolved

rabernat and others added 15 commits April 26, 2021 10:17

added safe_chunks option without breaking anything

4cc15f9

pass options through

c9067f2

add another test

7c26de4

check warnings

ed2ef01

check chunk encoding on rechunked zarr array

46c52f3

cherry pick and clean up

db26b16

revert to NotImplementedError

20ddf76

expand to_zarr docstring

3b84223

what's new

d19ba46

pre-commit

1653678

fix docstring error

6e4ff4c

remove warning for safe_chunks=False

bac358f

remove warning from io docs

8c86960

Apply suggestions from code review

dc711d9

Co-authored-by: keewis <keewis@users.noreply.github.com>

fix pytest syntax

626fa06

rabernat force-pushed the zarr-chunk-fixes branch from 2acab90 to 626fa06 Compare April 26, 2021 14:28

make blackdoc happy

023920b

dcherian merged commit dd7f742 into pydata:master Apr 26, 2021

mathause mentioned this pull request Apr 27, 2021

Zarr encoding attributes persist after slicing data, raising error on to_zarr #5219

Open

sharkinsspatial mentioned this pull request Apr 28, 2021

Pin xarray to a compatible version. pangeo-forge/pangeo-forge-recipes#111

Closed

jhamman mentioned this pull request May 12, 2021

Fix tests with last Xarray versions xpublish-community/xpublish#80

Closed

keewis mentioned this pull request Mar 3, 2022

propagation of encoding #6323

Open

		new_encoding = None # rechunking removes all encoding
		return type(self)(self.dims, data, self._attrs, new_encoding, fastpath=True)

Zarr chunking fixes #5065

Zarr chunking fixes #5065

Conversation

rabernat commented Mar 22, 2021 • edited Loading

pep8speaks commented Mar 22, 2021 • edited Loading

Comment last updated at 2021-04-26 14:38:56 UTC

rabernat commented Mar 22, 2021 • edited Loading

andersy005 commented Mar 22, 2021

rabernat commented Mar 22, 2021

dcherian left a comment

Choose a reason for hiding this comment

dcherian Mar 24, 2021

Choose a reason for hiding this comment

rabernat Mar 24, 2021

Choose a reason for hiding this comment

dcherian Mar 24, 2021

Choose a reason for hiding this comment

shoyer commented Mar 24, 2021

rabernat commented Mar 25, 2021 • edited Loading

shoyer commented Mar 25, 2021

dcherian commented Mar 25, 2021

rabernat commented Mar 25, 2021

shoyer commented Mar 25, 2021

aurghs commented Mar 26, 2021 • edited Loading

rabernat commented Mar 31, 2021

rabernat commented Mar 31, 2021 • edited Loading

aurghs commented Mar 31, 2021 • edited Loading

aurghs commented Mar 31, 2021 • edited Loading

dcherian commented Mar 31, 2021

rabernat commented Mar 31, 2021

rabernat commented Mar 31, 2021 • edited Loading

aurghs commented Mar 31, 2021 • edited Loading

rabernat commented Mar 31, 2021

shoyer commented Mar 31, 2021

shoyer Mar 31, 2021

Choose a reason for hiding this comment

shoyer commented Mar 31, 2021

rabernat commented Apr 7, 2021

rabernat commented Apr 12, 2021

keewis left a comment

Choose a reason for hiding this comment

dcherian left a comment

Choose a reason for hiding this comment

rabernat commented Apr 26, 2021

keewis commented Apr 26, 2021

rabernat commented Apr 26, 2021

dcherian commented Apr 26, 2021

rabernat commented Mar 22, 2021 •

edited

Loading

pep8speaks commented Mar 22, 2021 •

edited

Loading

rabernat commented Mar 22, 2021 •

edited

Loading

rabernat commented Mar 25, 2021 •

edited

Loading

aurghs commented Mar 26, 2021 •

edited

Loading

rabernat commented Mar 31, 2021 •

edited

Loading

aurghs commented Mar 31, 2021 •

edited

Loading

aurghs commented Mar 31, 2021 •

edited

Loading

rabernat commented Mar 31, 2021 •

edited

Loading

aurghs commented Mar 31, 2021 •

edited

Loading