Harmonize `FillValue` and `missing_value` during encoding and decoding steps #3502

andersy005 · 2019-11-09T01:07:25Z

As pointed out in jbusecke/xMIP#5, xarray appears to be very strict during the encoding and decoding steps even when there are (harmless) discrepancies between missing_value and _FillValue. For instance, when dtypes of missing_value and _FillValue are different, xarray gives up:

In [74]: from xarray.coding import variables                                                    

In [75]: import numpy as np                                                                     

In [76]: import xarray as xr                                                                    

In [77]: original = xr.Variable( 
    ...:         ("x",), 
    ...:         [0.0, -1.0, 1.0], 
    ...:         encoding={"_FillValue": np.float32(1e20), "missing_value": np.float64(1e20)}, 
    ...:     )                                                                                  

In [78]: coder = variables.CFMaskCoder()                                                        

In [79]: encoded = coder.encode(original)                                                       
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-79-9fbc3632e28b> in <module>
----> 1 encoded = coder.encode(original)

/glade/work/abanihi/devel/pangeo/xarray/xarray/coding/variables.py in encode(self, variable, name)
    156             raise ValueError(
    157                 "Variable {!r} has multiple fill values {}. "
--> 158                 "Cannot encode data. ".format(name, [fv, mv])
    159             )
    160 

ValueError: Variable None has multiple fill values [1e+20, 1e+20]. Cannot encode data.

Closes #xxxx
Tests added
Passes black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

andersy005 · 2019-11-09T01:13:22Z

@dcherian, what is the right way to do the type casting in encode()?

I thought of trying something along these lines:

encoding["missing_value"] = encoding["missing_value"].astype(data.dtype)

However, I quickly realized that this breaks when encoding["missing_value"] is not a numpy object.

EDIT:

I will try using np.asarray():

encoding["missing_value"] = np.asarray(encoding["missing_value"]).astype(data.dtype)

dcherian · 2019-11-11T15:34:55Z

Thanks for taking this on, Anderson.

You'll have to convert back from the numpy array before exiting that function. It may be simpler if you avoid the cast and just keep the change from equivalent to allclose_or_equiv.

For reference, here's how xarray determines dtype for netCDF4

xarray/xarray/backends/netCDF4_.py

Lines 131 to 141 in 4e9240a

    
           def _nc4_dtype(var): 
        
               if "dtype" in var.encoding: 
        
                   dtype = var.encoding.pop("dtype") 
        
                   _check_encoding_dtype_is_vlen_string(dtype) 
        
               elif coding.strings.is_unicode_dtype(var.dtype): 
        
                   dtype = str 
        
               elif var.dtype.kind in ["i", "u", "f", "c", "S"]: 
        
                   dtype = var.dtype 
        
               else: 
        
                   raise ValueError(f"unsupported dtype for netCDF4 variable: {var.dtype}") 
        
               return dtype

(I found this by working looking for the createVariable statement here)

xarray/xarray/backends/netCDF4_.py

Lines 439 to 485 in 4e9240a

    
           def prepare_variable( 
        
               self, name, variable, check_encoding=False, unlimited_dims=None 
        
           ): 
        
               datatype = _get_datatype( 
        
                   variable, self.format, raise_on_invalid_encoding=check_encoding 
        
               ) 
        
               attrs = variable.attrs.copy() 
        
               fill_value = attrs.pop("_FillValue", None) 
        
               if datatype is str and fill_value is not None: 
        
                   raise NotImplementedError( 
        
                       "netCDF4 does not yet support setting a fill value for " 
        
                       "variable-length strings " 
        
                       "(https://github.com/Unidata/netcdf4-python/issues/730). " 
        
                       "Either remove '_FillValue' from encoding on variable %r " 
        
                       "or set {'dtype': 'S1'} in encoding to use the fixed width " 
        
                       "NC_CHAR type." % name 
        
                   ) 
        
               encoding = _extract_nc4_variable_encoding( 
        
                   variable, raise_on_invalid=check_encoding, unlimited_dims=unlimited_dims 
        
               ) 
        
               if name in self.ds.variables: 
        
                   nc4_var = self.ds.variables[name] 
        
               else: 
        
                   nc4_var = self.ds.createVariable( 
        
                       varname=name, 
        
                       datatype=datatype, 
        
                       dimensions=variable.dims, 
        
                       zlib=encoding.get("zlib", False), 
        
                       complevel=encoding.get("complevel", 4), 
        
                       shuffle=encoding.get("shuffle", True), 
        
                       fletcher32=encoding.get("fletcher32", False), 
        
                       contiguous=encoding.get("contiguous", False), 
        
                       chunksizes=encoding.get("chunksizes"), 
        
                       endian="native", 
        
                       least_significant_digit=encoding.get("least_significant_digit"), 
        
                       fill_value=fill_value, 
        
                   ) 
        
               nc4_var.setncatts(attrs) 
        
               target = NetCDF4ArrayWrapper(name, self) 
        
               return target, variable.data

xarray/coding/variables.py

…ng-values

andersy005 · 2019-11-12T05:47:26Z

@dcherian,

You'll have to convert back from the numpy array before exiting that function.

I was able to address this using @shoyer's suggestion above.

dcherian

Some minor comments. Looks great!
Thanks @andersy005

xarray/coding/variables.py

pep8speaks · 2019-11-12T17:35:26Z

Hello @andersy005! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-12 19:10:24 UTC

andersy005 · 2019-11-12T17:47:11Z

@dcherian & @shoyer, thank you both for your help!

In which section(bug fixes? enhancements?) in whats-new.rst should I document these changes?

dcherian · 2019-11-12T17:48:30Z

Let's do bug fixes

max-sixty · 2019-11-14T01:22:57Z

Thanks @andersy005 !

* upstream/master: Added fill_value for unstack (pydata#3541) Add DatasetGroupBy.quantile (pydata#3527) ensure rename does not change index type (pydata#3532) Leave empty slot when not using accessors interpolate_na: Add max_gap support. (pydata#3302) units & deprecation merge (pydata#3530) Fix set_index when an existing dimension becomes a level (pydata#3520) add Variable._replace (pydata#3528) Tests for module-level functions with units (pydata#3493) Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502) FUNDING.yml (pydata#3523) Allow appending datetime & boolean variables to zarr stores (pydata#3504) warn if dim is passed to rolling operations. (pydata#3513) Deprecate allow_lazy (pydata#3435) Recursive tokenization (pydata#3515)

* upstream/master: (22 commits) Added fill_value for unstack (pydata#3541) Add DatasetGroupBy.quantile (pydata#3527) ensure rename does not change index type (pydata#3532) Leave empty slot when not using accessors interpolate_na: Add max_gap support. (pydata#3302) units & deprecation merge (pydata#3530) Fix set_index when an existing dimension becomes a level (pydata#3520) add Variable._replace (pydata#3528) Tests for module-level functions with units (pydata#3493) Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502) FUNDING.yml (pydata#3523) Allow appending datetime & boolean variables to zarr stores (pydata#3504) warn if dim is passed to rolling operations. (pydata#3513) Deprecate allow_lazy (pydata#3435) Recursive tokenization (pydata#3515) format indexing.rst code with black (pydata#3511) add missing pint integration tests (pydata#3508) DOC: update bottleneck repo url (pydata#3507) add drop_sel, drop_vars, map to api.rst (pydata#3506) remove syntax warning (pydata#3505) ...

* master: (24 commits) Tweaks to release instructions (pydata#3555) Clarify conda environments for new contributors (pydata#3551) Revert to dev version 0.14.1 whatsnew (pydata#3547) sparse option to reindex and unstack (pydata#3542) Silence sphinx warnings (pydata#3516) Numpy 1.18 support (pydata#3537) tweak whats-new. (pydata#3540) small simplification of rename from pydata#3532 (pydata#3539) Added fill_value for unstack (pydata#3541) Add DatasetGroupBy.quantile (pydata#3527) ensure rename does not change index type (pydata#3532) Leave empty slot when not using accessors interpolate_na: Add max_gap support. (pydata#3302) units & deprecation merge (pydata#3530) Fix set_index when an existing dimension becomes a level (pydata#3520) add Variable._replace (pydata#3528) Tests for module-level functions with units (pydata#3493) Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502) FUNDING.yml (pydata#3523) ...

Replace equivalent() with allclose_or_equiv()

b61deeb

Ensure _FillValue & missing_value are cast to same dtype as data's

0740853

shoyer reviewed Nov 12, 2019

View reviewed changes

xarray/coding/variables.py Outdated Show resolved Hide resolved

andersy005 added 2 commits November 11, 2019 21:50

Merge branch 'master' of github.com:pydata/xarray into fix-fill-missi…

9b5108d

…ng-values

Use Numpy scalar during type casting

404bd75

dcherian approved these changes Nov 12, 2019

View reviewed changes

xarray/coding/variables.py Outdated Show resolved Hide resolved

xarray/coding/variables.py Outdated Show resolved Hide resolved

xarray/coding/variables.py Outdated Show resolved Hide resolved

Update ValueError message

d3e9562

Formatting only

3aa18f0

andersy005 marked this pull request as ready for review November 12, 2019 17:37

Update whats-new.rst

43b84d4

andersy005 changed the title ~~Relax requirements during encoding and decoding steps~~ Harmonize FillValue and missing_value during encoding and decoding steps Nov 12, 2019

max-sixty merged commit eece079 into pydata:master Nov 14, 2019

andersy005 deleted the fix-fill-missing-values branch December 11, 2019 20:09

spencerkclark mentioned this pull request Dec 15, 2019

Issue serializing arrays of times with certain dtype and _FillValue encodings #3624

Closed

spencerkclark mentioned this pull request Jan 15, 2020

Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode #3652

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harmonize `FillValue` and `missing_value` during encoding and decoding steps #3502

Harmonize `FillValue` and `missing_value` during encoding and decoding steps #3502

andersy005 commented Nov 9, 2019 •

edited

Loading

andersy005 commented Nov 9, 2019 •

edited

Loading

dcherian commented Nov 11, 2019

andersy005 commented Nov 12, 2019

dcherian left a comment

pep8speaks commented Nov 12, 2019 •

edited

Loading

andersy005 commented Nov 12, 2019

dcherian commented Nov 12, 2019

max-sixty commented Nov 14, 2019

Harmonize FillValue and missing_value during encoding and decoding steps #3502

Harmonize FillValue and missing_value during encoding and decoding steps #3502

Conversation

andersy005 commented Nov 9, 2019 • edited Loading

andersy005 commented Nov 9, 2019 • edited Loading

dcherian commented Nov 11, 2019

andersy005 commented Nov 12, 2019

dcherian left a comment

Choose a reason for hiding this comment

pep8speaks commented Nov 12, 2019 • edited Loading

Comment last updated at 2019-11-12 19:10:24 UTC

andersy005 commented Nov 12, 2019

dcherian commented Nov 12, 2019

max-sixty commented Nov 14, 2019

Harmonize `FillValue` and `missing_value` during encoding and decoding steps #3502

Harmonize `FillValue` and `missing_value` during encoding and decoding steps #3502

andersy005 commented Nov 9, 2019 •

edited

Loading

andersy005 commented Nov 9, 2019 •

edited

Loading

pep8speaks commented Nov 12, 2019 •

edited

Loading