Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore ability to specify _FillValue as Python native integers #9258

Merged
merged 1 commit into from
Jul 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions xarray/coding/variables.py
Original file line number Diff line number Diff line change
Expand Up @@ -520,9 +520,19 @@ def encode(self, variable: Variable, name: T_Name = None) -> Variable:
# trying to get it from encoding, resort to an int with the same precision as data.dtype if not available
signed_dtype = np.dtype(encoding.get("dtype", f"i{data.dtype.itemsize}"))
if "_FillValue" in attrs:
new_fill = np.array(attrs["_FillValue"])
# use view here to prevent OverflowError
attrs["_FillValue"] = new_fill.view(signed_dtype).item()
try:
# user provided the on-disk signed fill
new_fill = signed_dtype.type(attrs["_FillValue"])
except OverflowError:
Comment on lines +523 to +526
Copy link
Contributor

@dcherian dcherian Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit ill-defined it seems, so let's raise SerializationWarning to encourage users to provide the signed on-disk value.

Do we have the except clause only for potentially-wrong backwards compatibile behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it now, the except clause is probably the common case as far as xarray is concerned. That is, data loaded with xarray will use the decode logic which will use the unsigned integer version of the fill value. However, for anyone constructing their own DataArray/Dataset objects, there is a good chance of providing the signed version of the fill value (the try clause).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I should have said, my original thinking in the related issue was that _FillValue should always be the on-disk type, but then realized it makes sense to have it match the in-memory loaded data type. I too would like there to be only one way of doing it, but I'm not sure how likely that is. I suppose the current except clause should be the preferred way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decode logic will treat -1 and 255 the same for on-disk int8 type (as intended).

my original thinking in the related issue was that _FillValue should always be the on-disk type, but then realized it makes sense to have it match the in-memory loaded data type.

CF says _FillValue must be the on-disk type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood @kmuehlbauer, the in-memory type is needed for masking decoding to work, but also said that maybe the unsigned and masking should be combined. I could also see an argument for it being wrong to have a _FillValue that doesn't work for and match the data it is attached to (in-memory).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct for the encoding pipeline. I can add a serialization warning for the case where the provided _FillValue is not already the on-disk signed type.

For the decoding, you say that it already overwrites the _FillValue after casting to unsigned. Which part overwrites _FillValue? The CFMaskCoder uses the unsigned _FillValue and retains it in .encoding["_FillValue"] from my tests with a real NetCDF file. I'm also just realizing that xarray .open_dataset always converts a pure integer variable to a floating point variable and replaces fills with NaN.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a serialization warning for the case where the provided _FillValue is not already the on-disk signed type.

Won't this warn for python integers though?

The CFMaskCoder uses the unsigned _FillValue and retains it in .encoding["_FillValue"] from my tests with a real NetCDF file.

Ah yes this is correct, we don't reset the type after masking and that would be why we want to combine UnsignedIntegerCoder and CFMaskCoder.

I think we can combine these in a future PR

I'm also just realizing that xarray .open_dataset always converts a pure integer variable to a floating point variable and replaces fills with NaN.

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a serialization warning for the case where the provided _FillValue is not already the on-disk signed type.

Won't this warn for python integers though?

To be clear this would be in the encode pipeline and would only happen if the fill provided doesn't fit into the on-disk dtype.

In [9]: np.dtype(np.int8).type(255)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Cell In[9], line 1
----> 1 np.dtype(np.int8).type(255)

OverflowError: Python integer 255 out of bounds for int8

In [10]: np.dtype(np.int8).type(-1)
Out[10]: np.int8(-1)

In [11]: np.__version__
Out[11]: '2.0.0'

So 255 -> int8 is error, -1 -> int8 is fine.

BUT this means anyone doing a xr.open_dataset followed by a .to_netcdf will get the serialization warning with the logic as is (main and this PR) because the _FillValue was made unsigned in the decode pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUT this means anyone doing a xr.open_dataset followed by a .to_netcdf will get the serialization warning with the logic as is (main and this PR) because the _FillValue was made unsigned in the decode pipeline.

OK good argument.

It seems to me like this is good to go then? And we can follow up with a PR that merges UnsignedCoder and MaskCoder, and handles the FillValue properly. Does that sound right to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the new serialization warning, yes, I think this is as I expect. Right now this PR should at least restore behavior. If anything, especially based on this discussion, this code is too flexible...but that's kind of how anyone claiming they make CF-compliant NetCDF files assumes things will work.

# user provided the in-memory unsigned fill, convert to signed type
unsigned_dtype = np.dtype(f"u{signed_dtype.itemsize}")
# use view here to prevent OverflowError
new_fill = (
np.array(attrs["_FillValue"], dtype=unsigned_dtype)
.view(signed_dtype)
.item()
)
attrs["_FillValue"] = new_fill
data = duck_array_ops.astype(duck_array_ops.around(data), signed_dtype)

return Variable(dims, data, attrs, encoding, fastpath=True)
Expand Down
4 changes: 2 additions & 2 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ def create_encoded_masked_and_scaled_data(dtype: np.dtype) -> Dataset:

def create_unsigned_masked_scaled_data(dtype: np.dtype) -> Dataset:
encoding = {
"_FillValue": np.int8(-1),
"_FillValue": 255,
dcherian marked this conversation as resolved.
Show resolved Hide resolved
"_Unsigned": "true",
"dtype": "i1",
"add_offset": dtype.type(10),
Expand Down Expand Up @@ -925,7 +925,7 @@ def test_roundtrip_mask_and_scale(self, decoded_fn, encoded_fn, dtype) -> None:
assert decoded.variables[k].dtype == actual.variables[k].dtype
assert_allclose(decoded, actual, decode_bytes=False)

@pytest.mark.parametrize("fillvalue", [np.int8(-1), np.uint8(255)])
@pytest.mark.parametrize("fillvalue", [np.int8(-1), np.uint8(255), -1, 255])
def test_roundtrip_unsigned(self, fillvalue):
# regression/numpy2 test for
encoding = {
Expand Down
Loading