Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.chunk() and DataArray.chunk() now set encoding attribute #8069

Closed
wants to merge 3 commits into from

Conversation

Metamess
Copy link
Contributor

@Metamess Metamess commented Aug 14, 2023

This PR aims to fix the bug identified in Issue #8062 , where calling the .chunk() function on a DataArray or Dataset would change the chunking of the underlying data, but not change the "chunks" entry in the encoding attribute. This would lead to Datasets that cannot be written to file (zarr) without raising an error. The setting of the encoding attribute via the .chunk() function is part of the expected behavior based on the docstring of Dataset.to_zarr:

Zarr chunks are determined in the following way:
- From the chunks attribute in each variable's encoding (can be set via Dataset.chunk).

@welcome
Copy link

welcome bot commented Aug 14, 2023

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@Metamess
Copy link
Contributor Author

Through the tests in test_backend.py I encountered an error in the zarr library when setting attempting to write a Dataset with a dimension of length 0. While the size of the Dask chunk can be 0, the value in the encoding attribute can not. In the zarr code, it can be found that they default to a chunksize of 1 when the dimension has length 0.

It can be debated whether encoding = {"chunks": (0, )} should be a valid value and Zarr should support this, or if 0 is not a sensible value and xarray should set the value to 1 in these cases. For now I have chosen the latter, and edited the code in _maybe_await() to set the value in encoding to 1 if the given chunksize is 0. Please let me know if this would be considered the desired behavior!

@Metamess
Copy link
Contributor Author

There is still one test that fails, and which actually challenges the current state of the proposed fix:
The way the test test_backends.py::TestZarrDictStore::test_chunk_encoding_with_dask is written, it is clear that here it is assumed that calling .chunk() does not set any encoding. I did not want to overrule this by rewriting this test without further input.

From my current understanding, I see three ways forward:

  • chunk() should always set encoding, as implied by the docstring of to_zarr. The test should be rewritten with the assumption that encoding is set by chunk(). This option could be acceptable, unless use cases are identified where setting the encoding when calling chunks() is undesirable.
  • chunk() should not set encoding. The to_zarr docstrings should be updated to no longer suggest this. Ideally, a new function is added to Dataset and DataArray to set the "chunks" in the encoding. This sounds like the least desirable option to me, as it actively creates situations where changing the chunksizes on a Dataset leaves the change only half done.
  • chunk() should get another optional keyword argument, equal to or similar to overwrite_encoded_chunks of _maybe_await, that is passed on to _maybe_await and influences whether or not the encoding attribute is changed when calling chunk(). I would suggest the default value of this argument to be True, since this seems to be the commonly desirable case. This option would keep supporting the assumptions present in the test, and keep support for use cases where changing the encoding attribute is considered undesirable. Would change the signature of the chunk() function.

In all cases, I would suggest to add a line to the docstring of the chunk() function to explicitly specify what happens to the encoding attribute.

@TomNicholas TomNicholas added topic-zarr Related to zarr storage library topic-metadata Relating to the handling of metadata (i.e. attrs and encoding) labels Aug 14, 2023
@TomNicholas

This comment was marked as outdated.

@github-actions github-actions bot removed the topic-zarr Related to zarr storage library label Sep 30, 2023
@max-sixty
Copy link
Collaborator

@pydata/xarray I don't know encoding well, but what should we do next on this?

@max-sixty
Copy link
Collaborator

Does someone have a clear vision of how we're evolving the encoding attribute? I'm regretfully out of the loop here, and I worry we lose the energy of folks such as @Metamess without offering direction...

@dcherian
Copy link
Contributor

Yeah unfortunately we are all agreed on deleting encoding ASAP. So this should be closed.

Thanks for trying @Metamess I'm sorry that your first PR can't make it through. We hope to see you back soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs review topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset.chunk() does not overwrite encoding["chunks"]
4 participants