Improved CF decoding #6812

mankoff · 2022-07-19T19:44:27Z

Closes float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray #2304 - but only for my specific use case.
Tests added

The comments above this line state, "so we just use a float64" but then it returns np.float32. I assume the comments are correct. Changing this also fixes a bug I ran into.

Note that currently, _choose_float_dtype returns float32 if the data is float16 or float32, even if the scale_factor dtype is float64.

mankoff · 2022-07-19T19:46:23Z

Note - I also have not run the "Running the performance test suite" code in https://xarray.pydata.org/en/stable/contributing.html - I assume changing from float32 to float64 would impact performance. I can run that if suggested.

mankoff · 2022-07-19T20:19:29Z

I'm reading more in https://github.com/pydata/xarray/blob/2a5686c6fe855502523e495e43bd381d14191c7b/xarray/coding/variables.py and I'm confused about some logic:

xarray/xarray/coding/variables.py

Lines 271 to 272 in 2a5686c

    
           add_offset = pop_to(attrs, encoding, "add_offset", name=name) 
        
           dtype = _choose_float_dtype(data.dtype, "add_offset" in attrs)

pop_to does a pop operation - it removes the key/value pair. So line 1 above will remove add_offset from attrs if it exists. The second line then checks for "add_offset" in attrs which should always be False.

I think this is happening based on inspecting with the debugger.

Furthermore, the fix I implemented in this Pull Request which returns np.float64 fixes my bug, but only because this bug exists. My dataset has add_offset, so the lines I changed:

        if not has_offset:
            return np.float64

should not run, but do run because of this issue.

Line above this removes 'add_offset' from 'attrs' (if it exists), so '"add_offset" in attrs' should always be false. It was moved into 'encoding' so let's check for it there.

Modified _choose_float_dtype + Returns float32 if inputs are float16 or float32 + Returns float64 if inputs are int

dcherian · 2022-07-22T22:10:53Z

xarray/coding/variables.py

        if not has_offset:
-            return np.float32
+            return np.float64


I think the code matches the comments. It would be clearer if written as

if has_offset: return np.float64 else: return np.float32

Without your edits, if there is an offset the condition does not trigger and we return np.float64 later

Thanks for reviewing this pull request. FYI my original comment (later edited) said:

Also, before this is merged, I'd like to suggest a larger change, and possibly discuss architecture here a bit (if appropriate). Specifically, I'd like to change the _choose_float_dtype function, and the two calls to it, to pass in the dtype of scale_factor and add_offset, in addition to the data dtype. This function should then return the dtype of the highest precision of three.

Currently, _choose_float_dtype returns float32 if the data is float16 or float32, even if the scale_factor dtype is float64.

Based on your comment, I think my original intuition - that this function needs a large rewrite - is correct. I'll look into this and submit additional commits to this PR.

Thanks for looking into this!

dcherian · 2022-07-22T22:12:36Z

xarray/coding/variables.py

@@ -269,7 +269,7 @@ def decode(self, variable, name=None):
        if "scale_factor" in attrs or "add_offset" in attrs:
            scale_factor = pop_to(attrs, encoding, "scale_factor", name=name)
            add_offset = pop_to(attrs, encoding, "add_offset", name=name)
-            dtype = _choose_float_dtype(data.dtype, "add_offset" in attrs)
+            dtype = _choose_float_dtype(data.dtype, "add_offset" in encoding)


I suspect this fixed one issue, but the original issue still remains because we still aren't looking at the dtype of scale_factor and add_offset as recommended by the conventions.

Note - I think the conventions referred to above are: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html or

If the scale_factor and add_offset attributes are of the same data type as the associated variable, the unpacked data is assumed to be of the same data type as the packed data. However, if the scale_factor and add_offset attributes are of a different data type from the variable (containing the packed data) then the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int. It is not advised to unpack an int into a float as there is a potential precision loss.

pydata#6812 (review)

https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html Split encoding and decoding for now.

mankoff · 2022-07-29T23:03:54Z

xarray/coding/variables.py

-        if not has_offset:
+        if has_offset:
+            return np.float64
+        else:


Reverted to original algorithm as per suggestion from dcherian.

mankoff · 2022-07-29T23:05:12Z

xarray/coding/variables.py

            return np.float32
    # For all other types and circumstances, we just use float64.
    # (safe because eg. complex numbers are not supported in NetCDF)
    return np.float64


+def _choose_float_dtype_decoding(dtype, scale_factor, add_offset):


I chose to focus on decoding per the CF specification, so I split the function. Furthermore, decoding makes heavy use of np.find_common_type to select the correct datatype for the final product.

mankoff · 2022-07-29T23:06:49Z

xarray/coding/variables.py

@@ -224,7 +224,7 @@ def _scale_offset_decoding(data, scale_factor, add_offset, dtype):
    return data


-def _choose_float_dtype(dtype, has_offset):
+def _choose_float_dtype_encoding(dtype, has_offset):


Encoding per the CF spec is fairly specific. The packed variable is supposed to be type byte/short/int, not float. Most of the tests encode with a scale_factor or add_offset that require the packed data to be type float. Rather than trying to solve all this, I have just split the encode and decode dtype function.

mankoff · 2022-07-29T23:07:44Z

xarray/coding/variables.py

-        if "add_offset" in encoding:
-            data -= pop_to(encoding, attrs, "add_offset", name=name)
-        if "scale_factor" in encoding:
-            data /= pop_to(encoding, attrs, "scale_factor", name=name)


The data type may change when adding or scaling, hence changing from data /= ... to data = data / ....

mankoff · 2022-07-29T23:08:33Z

xarray/tests/test_coding.py



-@pytest.mark.parametrize("scale_factor", (10, [10]))


I'm not sure that scale_factor or add_offset can be an array type per the CF spec, so I changed this test.

These kinds of things tend to happen though. Since we have tested for it, we should just keep it around.

dcherian · 2022-10-03T21:01:43Z

Sorry for dropping this @mankoff How can we move forward here?

mankoff · 2022-10-03T22:29:28Z

Hi @dcherian - I dropped this because I went down a rabbit hole that seemed very very deep.

Xarray has written 10s (100s?) of tests that touch this decoding function that make assumptions that I believe are incorrect after a careful reading of the CF spec. I believe the path forward will take some conversation before coding, so perhaps this should be moved to an issue rather than a pull request? A big decision is if the decode option strictly follows CF guidelines. If so, then a lot of tests need to be changed (for example, to follow the simple rule of [scale_factor and add_offset] must both be of type float or both be of type double).

Enforcing this would probably break xarray backward compatibility for writing files. I assume that that may be OK and there are processes to handle this (start with 'deprecation' warnings, then eventually throw errors?). There are also likely many NetCDF files that are not standard compliant and we need to decide how to read them.

Furthermore, the CF conventions are themselves not very clear, and possibly ambiguous. I started a conversation here: cf-convention/cf-conventions#374 on this, but that is also unresolved at the moment. The CF convention mentions int and float, but not how many bytes those are. What happens when a files is written & packed on one architecture and read & unpacked on another?

mankoff · 2022-10-06T02:48:22Z

A bit more detail about the existing tests that don't match the CF spec. Per the spec, scale_factor and add_offset should be of the same type. That causes tests throughout https://github.com/pydata/xarray/blob/main/xarray/tests/test_coding.py and https://github.com/pydata/xarray/blob/main/xarray/tests/test_backends.py to fail, because:

xarray/xarray/tests/test_coding.py

Lines 112 to 113 in 13c52b2

    
           @pytest.mark.parametrize("scale_factor", (10, [10])) 
        
           @pytest.mark.parametrize("add_offset", (0.1, [0.1]))

There is 1 test in test_coding, and 9 tests in test_backends that use mixed types. That's a tractable number I can fix.

In addition, the expected dtype returned by many of the tests does not match (my interpretation of) the expected dtype per the CF spec.

I am concerned that this is a significant change and I'm not sure what the process is for making this change. I would like to have some idea, even if not a guarantee, that it would be welcomed and accepted before doing all the work. I note that a recent other large PR to try to fix cf decoding has also stalled, and I'm not sure why (see #2751)

dcherian · 2022-10-17T16:27:22Z

A big decision is if the decode option strictly follows CF guidelines.

I think our general position is to be flexible on what we can read because there are many slightly non-compliant files out there.

Xarray has written 10s (100s?) of tests that touch this decoding function that make assumptions that I believe are incorrect after a careful reading of the CF spec.

Some of these might just be for convenience and some might be checking that we are flexible in what we can read.

This following test should be preserved so we can read those files (#4631):

 @pytest.mark.parametrize("scale_factor", (10, [10])) 
 @pytest.mark.parametrize("add_offset", (0.1, [0.1]))

Enforcing this would probably break xarray backward compatibility for writing files.

Do we not enforce that scale_factor and add_offset are of the same dtype on write? If so, we should consider that a bug and fix it.

I am concerned that this is a significant change and I'm not sure what the process is for making this change.

I think the way to move forward would be to figure out the smallest change that would fix (or even improve) #2304 and move on. We have a 30-minute bi-weekly meeting (#4001) that you're welcomed to attend and raise specific questions. The next one is Oct 26 at 9.30am Mountain Time

dcherian · 2023-04-01T15:26:04Z

We should figure out how to express some of this understanding as tests (some xfailed). That way it's easy to check when something gets fixed, and prevent regressions.

Make code match comments - use 64bit float

2a5686c

mankoff added 2 commits July 19, 2022 13:23

Fix logic bug

108586e

Line above this removes 'add_offset' from 'attrs' (if it exists), so '"add_offset" in attrs' should always be false. It was moved into 'encoding' so let's check for it there.

Fixed test suite for new float64 dtype

4eedd29

Modified _choose_float_dtype + Returns float32 if inputs are float16 or float32 + Returns float64 if inputs are int

dcherian reviewed Jul 22, 2022

View reviewed changes

dcherian added the needs work label Jul 22, 2022

Undo 2a5686c per correction from @dcherian

312acda

pydata#6812 (review)

mankoff mentioned this pull request Jul 29, 2022

Fix logic bug - add_offset is in encoding, not attrs. #6851

Merged

Rewrite per CF standards

4615720

https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html Split encoding and decoding for now.

mankoff commented Jul 29, 2022

View reviewed changes

mankoff changed the title ~~Make code match comments - use 64bit float~~ Improved CF decoding Jul 29, 2022

dcherian mentioned this pull request Mar 28, 2023

nan values appearing when saving and loading from netCDF due to encoding #7691

Closed

4 tasks

mankoff mentioned this pull request Apr 1, 2023

cf-coding #7654

Closed

4 tasks

Mikejmnez mentioned this pull request Mar 28, 2024

Diagnose xarray-with-pydap errors described below pydap/pydap#297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved CF decoding #6812

Improved CF decoding #6812

mankoff commented Jul 19, 2022 •

edited

Loading

mankoff commented Jul 19, 2022

mankoff commented Jul 19, 2022

dcherian Jul 22, 2022

mankoff Jul 22, 2022

dcherian Jul 22, 2022

dcherian Jul 22, 2022

mankoff Jul 22, 2022

mankoff Jul 29, 2022

mankoff Jul 29, 2022

mankoff Jul 29, 2022

mankoff Jul 29, 2022

mankoff Jul 29, 2022

dcherian Oct 3, 2022

dcherian commented Oct 3, 2022

mankoff commented Oct 3, 2022

mankoff commented Oct 6, 2022

dcherian commented Oct 17, 2022

dcherian commented Apr 1, 2023

Improved CF decoding #6812

Are you sure you want to change the base?

Improved CF decoding #6812

Conversation

mankoff commented Jul 19, 2022 • edited Loading

mankoff commented Jul 19, 2022

mankoff commented Jul 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian commented Oct 3, 2022

mankoff commented Oct 3, 2022

mankoff commented Oct 6, 2022

dcherian commented Oct 17, 2022

dcherian commented Apr 1, 2023

mankoff commented Jul 19, 2022 •

edited

Loading