Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

propagation of encoding #6323

Open
keewis opened this issue Mar 3, 2022 · 8 comments
Open

propagation of encoding #6323

keewis opened this issue Mar 3, 2022 · 8 comments

Comments

@keewis
Copy link
Collaborator

keewis commented Mar 3, 2022

What is your issue?

We frequently get bug reports related to encoding that can usually be fixed by clearing it or by overriding it using the encoding parameter of the to_* methods, e.g.

There are also a few discussions with more background:

We discussed this in the meeting yesterday and as far as I can remember agreed that the current default behavior is not ideal and decided to investigate #5336: a keep_encoding option, similar to keep_attrs, that would be True (propagate encoding) by default but will be changed to False (drop encoding on any operation) in the future.

cc @rabernat, @shoyer

@jhamman
Copy link
Member

jhamman commented Mar 27, 2023

See also #7686. The ideas presented here are also great!

@jhamman
Copy link
Member

jhamman commented Mar 31, 2023

This issue was discussed at this week's dev meeting. I will summarize what we discussed:

  1. General agreement that propagating encoding through arbitrary operations (e.g. slice, chunk, computation) leads to inconsistent states that are hard to protect against. This often leads to problems when serializing datasets in our backends.
  2. The primary benefit of keeping encoding on Xarray objects is the ability to exactly roundtrip datasets. However, this benefit is less obvious after a dataset has been modified.
  3. We currently have two APIs for setting encoding (e.g. to_netcdf(..., encoding={...}) and ds.encoding = {...}). We should change this by deprecating setting encoding on Xarray objects using the .encoding property.
  4. We can move towards providing utilities that expose a dataset's source encoding (e.g. open_dataset(..., return_encoding=True).

Specific action items that can happen now:

Longer term action items:

  • add option to backend readers to keep / discard interpreted encoding attributes
  • disable all encoding propagation by discarding encoding attributes once a Dataset has been modified.

@rabernat
Copy link
Contributor

We should also consider a configuration option to automatically drop encoding.

@klindsay28
Copy link

In the hypothetical invocation open_dataset(..., return_encoding=True), do you envision the returned encoding as being a separate returned object, or would it still be an attribute on the Dataset object?
I'm guessing the latter, because the subsequent statement 'disable all encoding propagation by discarding encoding attributes once a Dataset has been modified' doesn't make much sense to me for the former.
If so, after encoding attributes are discarded, would there still be an encoding attribute on the Dataset object that the user could reset to the values prior to the Dataset modification? This would enable the user to propagate encoding values through their workflow.

@shoyer
Copy link
Member

shoyer commented Apr 5, 2023

In the hypothetical invocation open_dataset(..., return_encoding=True), do you envision the returned encoding as being a separate returned object, or would it still be an attribute on the Dataset object?

My expectation was that this would be a separate object, e.g., dataset, encoding = xarray.open_dataset(..., return_encoding=True), where encoding is a dict providing the encoding on each variable, and which could be passed as the encoding argument into to_netcdf(). That said, I can see how keeping encoding as variable attributes could also be convenient.

"disable all encoding propagation by discarding encoding attributes once a Dataset has been modified" would be an intermediate step, on the route to removing encoding from Xarray's data model entirely entirely.

(As a side note, I would probably spell this as open_dataset_with_encoding rather than having a function with a variable return signature.)

@klindsay28
Copy link

In a future where encoding has been removed from Xarray's data model entirely, would open_dataset_with_encoding, or whatever name gets settled on, still exist? It's not clear to me if removal from the data model means just removing it from Xarray's data structures, or if it also means removing it from Xarray's APIs.

@Metamess
Copy link
Contributor

My expectation was that this would be a separate object, e.g., dataset, encoding = xarray.open_dataset(..., return_encoding=True), where encoding is a dict providing the encoding on each variable, and which could be passed as the encoding argument into to_netcdf(). That said, I can see how keeping encoding as variable attributes could also be convenient.

For your consideration, I would like to posit the following use case:
In some part of a larger application, new datasets are created through various means. Such a Dataset might move through any number of functions within the application, being passed either as a source of data, or with the intention to be transformed or appended to in some way. Finally, this Dataset reaches a function where it is written to a file. In any of the steps, from the creation to the intermediate processing to the final write stage, some encoding properties might be determined and added to the Dataset.

From this point of view, the encoding settings of the Dataset is logically an attribute of the Dataset and its elements. It would also be a pain (and lead to a degradation in code quality) to have to add a dataset_encoding parameter to all of these functions, and to modify all their return type signatures to a tuple of Dataset, dict, just to make sure the encoding gets propagated alongside the Dataset.

"disable all encoding propagation by discarding encoding attributes once a Dataset has been modified" would be an intermediate step, on the route to removing encoding from Xarray's data model entirely entirely.

My two cents: As a user, I would not expect arbitrary functions applied to a Dataset to also remove all encoding attributes. In fact, it would probably send me on a debug journey to figure out how, why and when my Dataset suddenly lost all the encoding settings I had added to it. Arguably, the clearing of encoding would be a side-effect, and one that most operations should not have.

If I understand correctly, in the end the properties stored in the encoding attribute are meant for a backend function/library that will write the Dataset to a file (like Zarr, or NetCDF, or even some custom format through a self defined function). The actual effect of these properties come from the meaning that these backends assign to them. Therefore I would not, as Xarray, make assumptions about what functions invalidate what properties of the encoding attribute, but leave this to the user. So perhaps a reasonable approach could be to let the encoding attribute exist, but to not have any Xarray functions add, delete or modify them. If a user performs a function that impacts the encoding, they should fix those values before attempting to write to a file. (For these purposes, I would consider functions like open_zarr to be 'backend' functions, which can add properties to the encoding attribute matching those that their to_file counterpart would ingest)

As long as the documentation is clear on this behavior, I believe anyone encountering encoding related issues should be able to figure out that they have to fix the encoding attributes causing the issue.

I hope this is a helpful contribution to the discussion :)

@max-sixty
Copy link
Collaborator

max-sixty commented Oct 25, 2023

Even before going through the items in #6323 (comment) — would it make sense to at least remove the old encoding attribute on .chunk? Or at least encoding["chunks"]?

(Possible we could have pushed #8069 towards this, rather than setting a new encoding attribute? Thanks again to @Metamess for starting that PR...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants