Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr chunking fixes #5065

Merged
merged 16 commits into from
Apr 26, 2021
Merged

Zarr chunking fixes #5065

merged 16 commits into from
Apr 26, 2021

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Mar 22, 2021

This PR contains two small, related updates to how Zarr chunks are handled.

  1. We now delete the encoding attribute at the Variable level whenever chunk is called. The persistence of chunk encoding has been the source of lots of confusion (see zarr and xarray chunking compatibility and to_zarr performance #2300, automatic chunking of zarr archive #4046, Error when rechunking from Zarr store #4380, Writing to zarr fails with message "specified zarr chunks would overlap multiple dask chunks" xcube-dev/xcube#347)
  2. Added a new option called safe_chunks in to_zarr which allows for bypassing the requirement of the many-to-one relationship between Zarr chunks and Dask chunks (see Allow "unsafe" mode for zarr writing #5056).

Both these touch the internal logic for how chunks are handled, so I thought it was easiest to tackle them with a single PR.

@pep8speaks
Copy link

pep8speaks commented Mar 22, 2021

Hello @rabernat! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-26 14:38:56 UTC

@rabernat
Copy link
Contributor Author

rabernat commented Mar 22, 2021

Confused about the test error. It seems unrelated. In test_sparse.py:test_variable_method

E   TypeError: no implementation found for 'numpy.allclose' on types that implement __array_function__: [<class 'numpy.ndarray'>, <class 'sparse._coo.core.COO'>]

@andersy005
Copy link
Member

Confused about the test error. It seems unrelated. In test_sparse.py:test_variable_method

E   TypeError: no implementation found for 'numpy.allclose' on types that implement __array_function__: [<class 'numpy.ndarray'>, <class 'sparse._coo.core.COO'>]

Related to #5059, and it appears that @keewis came up with a fix for it in #5059 (comment)

@rabernat
Copy link
Contributor Author

Thanks Anderson. Fixed by rebasing. Now RTD build is failing, but there is no obvious error in the logs...

xarray/backends/zarr.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rabernat I only have some docstring suggestions

xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
)
if safe_chunks:
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really minor comment: Shouldn't this still be NotImplementedError since we could technically support this by implementing locks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to be, but I changed it! Do we ever plan to implement locks?

doc/whats-new.rst Outdated Show resolved Hide resolved
doc/whats-new.rst Outdated Show resolved Hide resolved
@@ -1091,6 +1092,10 @@ def chunk(self, chunks={}, name=None, lock=False):

data = da.from_array(data, chunks, name=name, lock=lock, **kwargs)

# rechunking erases encoding
if self._encoding and "chunks" in self._encoding:
del self._encoding["chunks"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should mention this in the docstring for DataArray.chunk Dataset.chunk and Variable.chunk

@shoyer
Copy link
Member

shoyer commented Mar 24, 2021

I'm a little conflicted about dealing with encoding['chunks'] specifically in chunk():

  • On one hand, it feels inconsistent for this only this single method in xarray to modify part of encoding. Nothing else in xarray (after CF decoding) does this. Effectively encoding['chunks'] is now becoming a part of xarray's data model.
  • On the other hand, this would absolutely fix a recurrent pain-point for users, and in that sense it's worth doing.

Maybe this isn't such a big deal in this particular case, especially if we don't think we would need to add such encoding specific logic to any other methods. But are we really sure about that -- what about cases like indexing?

I guess the other alternative to make chunk() and various other methods that would change chunking drop encoding entirely. I don't know if this would really be a better comprehensive solution (I know dropping attrs is much hated), but at least it's an easier mental model.

@rabernat
Copy link
Contributor Author

rabernat commented Mar 25, 2021

I see your point. I guess I don't fully understand where else in the code path encoding gets dropped. Consider this example

import xarray as xr
ds = xr.Dataset({'foo': ('time', [1, 1], {'dtype': 'int16'})})
ds = xr.decode_cf(ds).compute()
assert "dtype" in ds.foo.encoding
assert "dtype" not in (0.5 * ds.foo).encoding

Xarray knows to drop the dtype encoding after an arithmetic operation. How does that work? To me .chunk feel like a similar case: an operation that invalidates any existing encoding.

@shoyer
Copy link
Member

shoyer commented Mar 25, 2021

Xarray knows to drop the dtype encoding after an arithmetic operation. How does that work? To me .chunk feel like a similar case: an operation that invalidates any existing encoding.

To be honest, the existing convention is quite adhoc, just based on what seemed most appropriate at the time.

#1614 is most comprehensive description of the current state of things.

We were considering saying that attrs and encoding should always use the same rules, but perhaps we should be more aggressive about dropping encoding.

@dcherian
Copy link
Contributor

Xarray knows to drop the dtype encoding after an arithmetic operation. How does that work?

There's a subtle difference. It drops all of .encoding not dtype specifically.

@shoyer's point about indexing changing chunking is a good one too. Perhaps a kwarg in to_zarr like ignore_encoding_chunks?

@rabernat
Copy link
Contributor Author

Perhaps a kwarg in to_zarr like ignore_encoding_chunks?

I would argue that this is unnecessary. If you want to explicitly drop encoding, just del da.encoding['chunks'] before writing. But most users don't figure out that they should do this, because the default behavior is counterintuitive.

The problem here is with the default behavior of propagating chunk encoding through computations when it no longer makes sense. My example with the dtype encoding illustrates that we already drop encoding on certain operations, so it's not unprecedented. It's more of an implementation question: where and how to do the dropping.

FWIW, I would also favor dropping encoding['chunks'] after indexing, coarsening, interpolating, etc. Basically anything that changes the array shape or chunk structure.

@shoyer
Copy link
Member

shoyer commented Mar 25, 2021

FWIW, I would also favor dropping encoding['chunks'] after indexing, coarsening, interpolating, etc. Basically anything that changes the array shape or chunk structure.

We already drop all of encoding after indexing. My guess is that we do the same for coarsening and interpolations as well (though I haven't checked).

@aurghs
Copy link
Collaborator

aurghs commented Mar 26, 2021

Perhaps we could remove also overwrite_encoded_chunks, it shouldn't be any more necessary.

@rabernat
Copy link
Contributor Author

I appreciate the discussion on this PR. Does anyone have a concrete suggestion of what to do?

If we are not in agreement about the encoding stuff, perhaps I should remove that and just move forward with the safe_chunks part of this PR?

@rabernat
Copy link
Contributor Author

rabernat commented Mar 31, 2021

In today's dev call, we proposed to handle encoding in chunk the same way we handle it in indexing: by deleting all encoding.

The problem is, I can't figure out where this happens. Can someone point me to the place in the code where indexing operations delete encoding?

A related question: I discovered this encoding option preferred_chunks, which is treated specially:

preferred_chunks = var.encoding.get("preferred_chunks", {})

Should the Zarr backend be setting this?

@aurghs
Copy link
Collaborator

aurghs commented Mar 31, 2021

Should the Zarr backend be setting this?

Yes, they are already defined in zarr: preferred_chunks=chunks. We decide to separate the chunks and the preferred_chunks:

  • The preferred_chunks is used by the backend to define the default chunks to be used by xarray.
  • The chunks are the on-disk chunks.

They are not necessarily the same.
Maybe we can drop the preferred_chunks after they are used.

@aurghs
Copy link
Collaborator

aurghs commented Mar 31, 2021

rechunk Variable.chunk is used always when you open a data with dask, even if you are using the default chunking. So in this way, you will drop the encoding always when dask is used (≈ always).

@dcherian
Copy link
Contributor

The problem is, I can't figure out where this happens.

Replace self._encoding with None here?

return type(self)(self.dims, data, self._attrs, self._encoding, fastpath=True)

@rabernat
Copy link
Contributor Author

Replace self._encoding with None here?

Thanks! Yeah that's what I had in mind. But I was wondering if there was an example of doing that it else I could copy.

In any case, I'll give it a try now.

@rabernat
Copy link
Contributor Author

rabernat commented Mar 31, 2021

A just pushed a new commit which deletes all encoding inside variable.chunk(). But as you will see when the CI finishes, this leads to a lot of test failures. For example:

=============================================================================== FAILURES ================================================================================
____________________________________________________ TestNetCDF4ViaDaskData.test_roundtrip_string_encoded_characters ____________________________________________________

self = <xarray.tests.test_backends.TestNetCDF4ViaDaskData object at 0x18cba4c40>

    def test_roundtrip_string_encoded_characters(self):
        expected = Dataset({"x": ("t", ["ab", "cdef"])})
        expected["x"].encoding["dtype"] = "S1"
        with self.roundtrip(expected) as actual:
            assert_identical(expected, actual)
>           assert actual["x"].encoding["_Encoding"] == "utf-8"
E           KeyError: '_Encoding'

/Users/rpa/Code/xarray/xarray/tests/test_backends.py:485: KeyError

Why is chunk getting called here? Does it actually get called every time we load a dataset with chunks? If so, we will need a more sophisticated solution.

@aurghs
Copy link
Collaborator

aurghs commented Mar 31, 2021

Does it actually get called every time we load a dataset with chunks?

Yes

@rabernat
Copy link
Contributor Author

So any ideas how to proceed? 🧐

@shoyer
Copy link
Member

shoyer commented Mar 31, 2021

Hmm. I would also be happy with explicitly deleting chunks from encoding for now. It's not adding a lot of technical debt.

In the long term, the whole handling of encoding should be revisited, e.g., see #5082

Comment on lines 1095 to 1096
new_encoding = None # rechunking removes all encoding
return type(self)(self.dims, data, self._attrs, new_encoding, fastpath=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a simpler way to achieve the same thing is just to omit the argument:

Suggested change
new_encoding = None # rechunking removes all encoding
return type(self)(self.dims, data, self._attrs, new_encoding, fastpath=True)
return type(self)(self.dims, data, self._attrs, fastpath=True)

@shoyer
Copy link
Member

shoyer commented Mar 31, 2021

Why is chunk getting called here? Does it actually get called every time we load a dataset with chunks? If so, we will need a more sophisticated solution.

This happens specifically on this line:

var = var.chunk(chunks, name=name2, lock=lock)

So perhaps it would make sense to copy encoding specifically in this case, e.g.,

        new_var = var.chunk(chunks, name=name2, lock=lock)
        new_var.encoding = var.encoding

@rabernat
Copy link
Contributor Author

rabernat commented Apr 7, 2021

I have removed the controversial encoding['chunks'] stuff from the PR. Now it only contains the safe_chunks option in to_zarr.

If there are no further comments on this, I think this is good to go.

@rabernat
Copy link
Contributor Author

Any further feedback on this now reduced-scope PR? Merging this would be helpful for moving forward Pangeo forge.

Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few documentation issues, but otherwise looks good to me. I don't know a lot about chunking and zarr, though.

xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Show resolved Hide resolved
xarray/core/dataset.py Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rabernat

xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
@rabernat
Copy link
Contributor Author

The pre-commit workflow is raising a blackdoc error I am not seeing in my local env

diff --git a/doc/internals/duck-arrays-integration.rst b/doc/internals/duck-arrays-integration.rst
index eb5c4d8..2bc3c1f 100644
--- a/doc/internals/duck-arrays-integration.rst
+++ b/doc/internals/duck-arrays-integration.rst
@@ -25,7 +25,7 @@ argument:
         ...
 
         def _repr_inline_(self, max_width):
-            """ format to a single line with at most max_width characters """
+            """format to a single line with at most max_width characters"""
             ...

@keewis
Copy link
Collaborator

keewis commented Apr 26, 2021

the reason is that black released a new version yesterday, and since we don't pin black for the blackdoc entry we get the new version. If you run pre-commit clean before pre-commit run --all-files you should see this change locally, too. To avoid situations like these we could start pinning black in the blackdoc entry (and run a script to synchronize this with the black entry on autoupdate).

@rabernat
Copy link
Contributor Author

I think this PR has received a very thorough review. I would be pleased if someone from @pydata/xarray would merge it soon.

@dcherian
Copy link
Contributor

Thanks @rabernat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow "unsafe" mode for zarr writing zarr and xarray chunking compatibility and to_zarr performance
7 participants