propagation of `encoding` #6323

keewis · 2022-03-03T12:57:29Z

What is your issue?

We frequently get bug reports related to encoding that can usually be fixed by clearing it or by overriding it using the encoding parameter of the to_* methods, e.g.

There are also a few discussions with more background:

We discussed this in the meeting yesterday and as far as I can remember agreed that the current default behavior is not ideal and decided to investigate #5336: a keep_encoding option, similar to keep_attrs, that would be True (propagate encoding) by default but will be changed to False (drop encoding on any operation) in the future.

cc @rabernat, @shoyer

The text was updated successfully, but these errors were encountered:

jhamman · 2023-03-27T20:06:07Z

See also #7686. The ideas presented here are also great!

jhamman · 2023-03-31T15:14:59Z

This issue was discussed at this week's dev meeting. I will summarize what we discussed:

General agreement that propagating encoding through arbitrary operations (e.g. slice, chunk, computation) leads to inconsistent states that are hard to protect against. This often leads to problems when serializing datasets in our backends.
The primary benefit of keeping encoding on Xarray objects is the ability to exactly roundtrip datasets. However, this benefit is less obvious after a dataset has been modified.
We currently have two APIs for setting encoding (e.g. to_netcdf(..., encoding={...}) and ds.encoding = {...}). We should change this by deprecating setting encoding on Xarray objects using the .encoding property.
We can move towards providing utilities that expose a dataset's source encoding (e.g. open_dataset(..., return_encoding=True).

Specific action items that can happen now:

add reset_encoding to Dataset/DatAarray api (Add reset_encoding to Dataset and DataArray objects #7686)
add a DeprecationWarning to the @property.setter for encoding on Dataset/DatAarray/Variable
document the change in callout in the Xarray user guide.

Longer term action items:

add option to backend readers to keep / discard interpreted encoding attributes
disable all encoding propagation by discarding encoding attributes once a Dataset has been modified.

rabernat · 2023-03-31T15:31:55Z

We should also consider a configuration option to automatically drop encoding.

klindsay28 · 2023-04-05T02:46:02Z

In the hypothetical invocation open_dataset(..., return_encoding=True), do you envision the returned encoding as being a separate returned object, or would it still be an attribute on the Dataset object?
I'm guessing the latter, because the subsequent statement 'disable all encoding propagation by discarding encoding attributes once a Dataset has been modified' doesn't make much sense to me for the former.
If so, after encoding attributes are discarded, would there still be an encoding attribute on the Dataset object that the user could reset to the values prior to the Dataset modification? This would enable the user to propagate encoding values through their workflow.

shoyer · 2023-04-05T04:49:34Z

In the hypothetical invocation open_dataset(..., return_encoding=True), do you envision the returned encoding as being a separate returned object, or would it still be an attribute on the Dataset object?

My expectation was that this would be a separate object, e.g., dataset, encoding = xarray.open_dataset(..., return_encoding=True), where encoding is a dict providing the encoding on each variable, and which could be passed as the encoding argument into to_netcdf(). That said, I can see how keeping encoding as variable attributes could also be convenient.

"disable all encoding propagation by discarding encoding attributes once a Dataset has been modified" would be an intermediate step, on the route to removing encoding from Xarray's data model entirely entirely.

(As a side note, I would probably spell this as open_dataset_with_encoding rather than having a function with a variable return signature.)

klindsay28 · 2023-04-05T13:06:12Z

In a future where encoding has been removed from Xarray's data model entirely, would open_dataset_with_encoding, or whatever name gets settled on, still exist? It's not clear to me if removal from the data model means just removing it from Xarray's data structures, or if it also means removing it from Xarray's APIs.

Metamess · 2023-08-15T10:30:58Z

My expectation was that this would be a separate object, e.g., dataset, encoding = xarray.open_dataset(..., return_encoding=True), where encoding is a dict providing the encoding on each variable, and which could be passed as the encoding argument into to_netcdf(). That said, I can see how keeping encoding as variable attributes could also be convenient.

For your consideration, I would like to posit the following use case:
In some part of a larger application, new datasets are created through various means. Such a Dataset might move through any number of functions within the application, being passed either as a source of data, or with the intention to be transformed or appended to in some way. Finally, this Dataset reaches a function where it is written to a file. In any of the steps, from the creation to the intermediate processing to the final write stage, some encoding properties might be determined and added to the Dataset.

From this point of view, the encoding settings of the Dataset is logically an attribute of the Dataset and its elements. It would also be a pain (and lead to a degradation in code quality) to have to add a dataset_encoding parameter to all of these functions, and to modify all their return type signatures to a tuple of Dataset, dict, just to make sure the encoding gets propagated alongside the Dataset.

"disable all encoding propagation by discarding encoding attributes once a Dataset has been modified" would be an intermediate step, on the route to removing encoding from Xarray's data model entirely entirely.

My two cents: As a user, I would not expect arbitrary functions applied to a Dataset to also remove all encoding attributes. In fact, it would probably send me on a debug journey to figure out how, why and when my Dataset suddenly lost all the encoding settings I had added to it. Arguably, the clearing of encoding would be a side-effect, and one that most operations should not have.

If I understand correctly, in the end the properties stored in the encoding attribute are meant for a backend function/library that will write the Dataset to a file (like Zarr, or NetCDF, or even some custom format through a self defined function). The actual effect of these properties come from the meaning that these backends assign to them. Therefore I would not, as Xarray, make assumptions about what functions invalidate what properties of the encoding attribute, but leave this to the user. So perhaps a reasonable approach could be to let the encoding attribute exist, but to not have any Xarray functions add, delete or modify them. If a user performs a function that impacts the encoding, they should fix those values before attempting to write to a file. (For these purposes, I would consider functions like open_zarr to be 'backend' functions, which can add properties to the encoding attribute matching those that their to_file counterpart would ingest)

As long as the documentation is clear on this behavior, I believe anyone encountering encoding related issues should be able to figure out that they have to fix the encoding attributes causing the issue.

I hope this is a helpful contribution to the discussion :)

max-sixty · 2023-10-25T23:20:15Z

Even before going through the items in #6323 (comment) — would it make sense to at least remove the old encoding attribute on .chunk? Or at least encoding["chunks"]?

(Possible we could have pushed #8069 towards this, rather than setting a new encoding attribute? Thanks again to @Metamess for starting that PR...)

keewis mentioned this issue Mar 3, 2022

'numpy.datetime64' object has no attribute 'year' writing from grib2 source fsspec/kerchunk#130

Open

jhamman mentioned this issue Dec 14, 2022

'open_mfdataset' zarr zip timestamp issue #7354

Open

4 tasks

keewis mentioned this issue Mar 27, 2023

Add reset_encoding to Dataset and DataArray objects #7686

Closed

jhamman mentioned this issue Apr 3, 2023

deprecate encoding setters #7708

Open

4 tasks

kmuehlbauer mentioned this issue Apr 6, 2023

Implement more Variable Coders #7719

Merged

1 task

kmuehlbauer mentioned this issue Apr 28, 2023

Writing and reopening introduces bad values #5739

Closed

dcherian mentioned this issue Aug 14, 2023

Dataset.chunk() does not overwrite encoding["chunks"] #8062

Closed

4 tasks

kmuehlbauer mentioned this issue Sep 1, 2023

Using xr.to_netcdf after xr.concat introduces max/min constraints based on first file #8135

Closed

4 tasks

kmuehlbauer mentioned this issue Sep 18, 2023

Differences on datetime values appears after writing reindexed variable on netCDF file #1064

Closed

This was referenced Nov 9, 2023

Error when rechunking from Zarr store #4380

Closed

Error when writing string coordinate variables to zarr #3476

Open

kmuehlbauer mentioned this issue Feb 5, 2024

Error while saving an altered dataset to NetCDF when loaded from a file #8694

Open

5 tasks

kmuehlbauer mentioned this issue Mar 25, 2024

ds.to_netcdf() changes values of variable #6272

Closed

TomNicholas mentioned this issue Apr 1, 2024

How to handle encoding zarr-developers/VirtualiZarr#68

Open

kmuehlbauer mentioned this issue May 13, 2024

_FillValue and missing_value attributes get removed when using open_dataset #9024

Closed

kmuehlbauer mentioned this issue Nov 10, 2024

merging and saving loaded datasets can lead to string truncation #9757

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

propagation of `encoding` #6323

propagation of `encoding` #6323

keewis commented Mar 3, 2022

jhamman commented Mar 27, 2023 •

edited

Loading

jhamman commented Mar 31, 2023

rabernat commented Mar 31, 2023

klindsay28 commented Apr 5, 2023

shoyer commented Apr 5, 2023

klindsay28 commented Apr 5, 2023

Metamess commented Aug 15, 2023

max-sixty commented Oct 25, 2023 •

edited

Loading

propagation of encoding #6323

propagation of encoding #6323

Comments

keewis commented Mar 3, 2022

What is your issue?

jhamman commented Mar 27, 2023 • edited Loading

jhamman commented Mar 31, 2023

rabernat commented Mar 31, 2023

klindsay28 commented Apr 5, 2023

shoyer commented Apr 5, 2023

klindsay28 commented Apr 5, 2023

Metamess commented Aug 15, 2023

max-sixty commented Oct 25, 2023 • edited Loading

propagation of `encoding` #6323

propagation of `encoding` #6323

jhamman commented Mar 27, 2023 •

edited

Loading

max-sixty commented Oct 25, 2023 •

edited

Loading