Dataset.chunk() and DataArray.chunk() now set encoding attribute #8069

Metamess · 2023-08-14T09:45:19Z

Closes Dataset.chunk() does not overwrite encoding["chunks"] #8062
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

This PR aims to fix the bug identified in Issue #8062 , where calling the .chunk() function on a DataArray or Dataset would change the chunking of the underlying data, but not change the "chunks" entry in the encoding attribute. This would lead to Datasets that cannot be written to file (zarr) without raising an error. The setting of the encoding attribute via the .chunk() function is part of the expected behavior based on the docstring of Dataset.to_zarr:

Zarr chunks are determined in the following way:
- From the chunks attribute in each variable's encoding (can be set via Dataset.chunk).

…ncoding attribute

welcome · 2023-08-14T09:45:22Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

Metamess · 2023-08-14T11:44:53Z

Through the tests in test_backend.py I encountered an error in the zarr library when setting attempting to write a Dataset with a dimension of length 0. While the size of the Dask chunk can be 0, the value in the encoding attribute can not. In the zarr code, it can be found that they default to a chunksize of 1 when the dimension has length 0.

It can be debated whether encoding = {"chunks": (0, )} should be a valid value and Zarr should support this, or if 0 is not a sensible value and xarray should set the value to 1 in these cases. For now I have chosen the latter, and edited the code in _maybe_await() to set the value in encoding to 1 if the given chunksize is 0. Please let me know if this would be considered the desired behavior!

Metamess · 2023-08-14T12:00:37Z

There is still one test that fails, and which actually challenges the current state of the proposed fix:
The way the test test_backends.py::TestZarrDictStore::test_chunk_encoding_with_dask is written, it is clear that here it is assumed that calling .chunk() does not set any encoding. I did not want to overrule this by rewriting this test without further input.

From my current understanding, I see three ways forward:

chunk() should always set encoding, as implied by the docstring of to_zarr. The test should be rewritten with the assumption that encoding is set by chunk(). This option could be acceptable, unless use cases are identified where setting the encoding when calling chunks() is undesirable.
chunk() should not set encoding. The to_zarr docstrings should be updated to no longer suggest this. Ideally, a new function is added to Dataset and DataArray to set the "chunks" in the encoding. This sounds like the least desirable option to me, as it actively creates situations where changing the chunksizes on a Dataset leaves the change only half done.
chunk() should get another optional keyword argument, equal to or similar to overwrite_encoded_chunks of _maybe_await, that is passed on to _maybe_await and influences whether or not the encoding attribute is changed when calling chunk(). I would suggest the default value of this argument to be True, since this seems to be the commonly desirable case. This option would keep supporting the assumptions present in the test, and keep support for use cases where changing the encoding attribute is considered undesirable. Would change the signature of the chunk() function.

In all cases, I would suggest to add a line to the docstring of the chunk() function to explicitly specify what happens to the encoding attribute.

max-sixty · 2023-09-30T19:11:09Z

@pydata/xarray I don't know encoding well, but what should we do next on this?

max-sixty · 2023-10-14T20:18:51Z

Does someone have a clear vision of how we're evolving the encoding attribute? I'm regretfully out of the loop here, and I worry we lose the energy of folks such as @Metamess without offering direction...

dcherian · 2023-10-14T20:39:05Z

Yeah unfortunately we are all agreed on deleting encoding ASAP. So this should be closed.

Thanks for trying @Metamess I'm sorry that your first PR can't make it through. We hope to see you back soon!

pydata#8062 Dataset.chunk() and DataArray.chunk() now correctly set e…

abbbdf8

…ncoding attribute

Metamess mentioned this pull request Aug 14, 2023

Dataset.chunk() does not overwrite encoding["chunks"] #8062

Closed

4 tasks

fixed test and mypy complaint

2ef1207

Metamess force-pushed the issue-8062-chunk-encoding branch from fc288f1 to 2ef1207 Compare August 14, 2023 11:38

TomNicholas added topic-zarr Related to zarr storage library topic-metadata Relating to the handling of metadata (i.e. attrs and encoding) labels Aug 14, 2023

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into issue-8062-chunk-encoding

1778076

github-actions bot removed the topic-zarr Related to zarr storage library label Sep 30, 2023

max-sixty added the needs review label Oct 14, 2023

dcherian closed this Oct 14, 2023

max-sixty mentioned this pull request Oct 25, 2023

propagation of encoding #6323

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset.chunk() and DataArray.chunk() now set encoding attribute #8069

Dataset.chunk() and DataArray.chunk() now set encoding attribute #8069

Metamess commented Aug 14, 2023 •

edited

Loading

welcome bot commented Aug 14, 2023

Metamess commented Aug 14, 2023

Metamess commented Aug 14, 2023

This comment was marked as outdated.

max-sixty commented Sep 30, 2023

max-sixty commented Oct 14, 2023

dcherian commented Oct 14, 2023

Dataset.chunk() and DataArray.chunk() now set encoding attribute #8069

Dataset.chunk() and DataArray.chunk() now set encoding attribute #8069

Conversation

Metamess commented Aug 14, 2023 • edited Loading

welcome bot commented Aug 14, 2023

Metamess commented Aug 14, 2023

Metamess commented Aug 14, 2023

This comment was marked as outdated.

max-sixty commented Sep 30, 2023

max-sixty commented Oct 14, 2023

dcherian commented Oct 14, 2023

Metamess commented Aug 14, 2023 •

edited

Loading