Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonize FillValue and missing_value during encoding and decoding steps #3502

Merged
merged 7 commits into from
Nov 14, 2019

Conversation

andersy005
Copy link
Member

@andersy005 andersy005 commented Nov 9, 2019

As pointed out in jbusecke/xMIP#5, xarray appears to be very strict during the encoding and decoding steps even when there are (harmless) discrepancies between missing_value and _FillValue. For instance, when dtypes of missing_value and _FillValue are different, xarray gives up:

In [74]: from xarray.coding import variables                                                    

In [75]: import numpy as np                                                                     

In [76]: import xarray as xr                                                                    

In [77]: original = xr.Variable( 
    ...:         ("x",), 
    ...:         [0.0, -1.0, 1.0], 
    ...:         encoding={"_FillValue": np.float32(1e20), "missing_value": np.float64(1e20)}, 
    ...:     )                                                                                  

In [78]: coder = variables.CFMaskCoder()                                                        

In [79]: encoded = coder.encode(original)                                                       
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-79-9fbc3632e28b> in <module>
----> 1 encoded = coder.encode(original)

/glade/work/abanihi/devel/pangeo/xarray/xarray/coding/variables.py in encode(self, variable, name)
    156             raise ValueError(
    157                 "Variable {!r} has multiple fill values {}. "
--> 158                 "Cannot encode data. ".format(name, [fv, mv])
    159             )
    160 

ValueError: Variable None has multiple fill values [1e+20, 1e+20]. Cannot encode data. 
  • Closes #xxxx
  • Tests added
  • Passes black . && mypy . && flake8
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

@andersy005
Copy link
Member Author

andersy005 commented Nov 9, 2019

@dcherian, what is the right way to do the type casting in encode()?

I thought of trying something along these lines:

encoding["missing_value"] = encoding["missing_value"].astype(data.dtype)

However, I quickly realized that this breaks when encoding["missing_value"] is not a numpy object.

EDIT:

I will try using np.asarray():

encoding["missing_value"] = np.asarray(encoding["missing_value"]).astype(data.dtype)

@dcherian
Copy link
Contributor

Thanks for taking this on, Anderson.

You'll have to convert back from the numpy array before exiting that function. It may be simpler if you avoid the cast and just keep the change from equivalent to allclose_or_equiv.

For reference, here's how xarray determines dtype for netCDF4

def _nc4_dtype(var):
if "dtype" in var.encoding:
dtype = var.encoding.pop("dtype")
_check_encoding_dtype_is_vlen_string(dtype)
elif coding.strings.is_unicode_dtype(var.dtype):
dtype = str
elif var.dtype.kind in ["i", "u", "f", "c", "S"]:
dtype = var.dtype
else:
raise ValueError(f"unsupported dtype for netCDF4 variable: {var.dtype}")
return dtype

(I found this by working looking for the createVariable statement here)

def prepare_variable(
self, name, variable, check_encoding=False, unlimited_dims=None
):
datatype = _get_datatype(
variable, self.format, raise_on_invalid_encoding=check_encoding
)
attrs = variable.attrs.copy()
fill_value = attrs.pop("_FillValue", None)
if datatype is str and fill_value is not None:
raise NotImplementedError(
"netCDF4 does not yet support setting a fill value for "
"variable-length strings "
"(https://github.com/Unidata/netcdf4-python/issues/730). "
"Either remove '_FillValue' from encoding on variable %r "
"or set {'dtype': 'S1'} in encoding to use the fixed width "
"NC_CHAR type." % name
)
encoding = _extract_nc4_variable_encoding(
variable, raise_on_invalid=check_encoding, unlimited_dims=unlimited_dims
)
if name in self.ds.variables:
nc4_var = self.ds.variables[name]
else:
nc4_var = self.ds.createVariable(
varname=name,
datatype=datatype,
dimensions=variable.dims,
zlib=encoding.get("zlib", False),
complevel=encoding.get("complevel", 4),
shuffle=encoding.get("shuffle", True),
fletcher32=encoding.get("fletcher32", False),
contiguous=encoding.get("contiguous", False),
chunksizes=encoding.get("chunksizes"),
endian="native",
least_significant_digit=encoding.get("least_significant_digit"),
fill_value=fill_value,
)
nc4_var.setncatts(attrs)
target = NetCDF4ArrayWrapper(name, self)
return target, variable.data

xarray/coding/variables.py Outdated Show resolved Hide resolved
@andersy005
Copy link
Member Author

@dcherian,

You'll have to convert back from the numpy array before exiting that function.

I was able to address this using @shoyer's suggestion above.

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Looks great!
Thanks @andersy005

xarray/coding/variables.py Outdated Show resolved Hide resolved
xarray/coding/variables.py Outdated Show resolved Hide resolved
xarray/coding/variables.py Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Nov 12, 2019

Hello @andersy005! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-12 19:10:24 UTC

@andersy005 andersy005 marked this pull request as ready for review November 12, 2019 17:37
@andersy005
Copy link
Member Author

@dcherian & @shoyer, thank you both for your help!

In which section(bug fixes? enhancements?) in whats-new.rst should I document these changes?

@dcherian
Copy link
Contributor

Let's do bug fixes

@andersy005 andersy005 changed the title Relax requirements during encoding and decoding steps Harmonize FillValue and missing_value during encoding and decoding steps Nov 12, 2019
@max-sixty max-sixty merged commit eece079 into pydata:master Nov 14, 2019
@max-sixty
Copy link
Collaborator

Thanks @andersy005 !

dcherian added a commit to dcherian/xarray that referenced this pull request Nov 17, 2019
* upstream/master:
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  Allow appending datetime & boolean variables to zarr stores (pydata#3504)
  warn if dim is passed to rolling operations. (pydata#3513)
  Deprecate allow_lazy (pydata#3435)
  Recursive tokenization (pydata#3515)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 17, 2019
* upstream/master: (22 commits)
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  Allow appending datetime & boolean variables to zarr stores (pydata#3504)
  warn if dim is passed to rolling operations. (pydata#3513)
  Deprecate allow_lazy (pydata#3435)
  Recursive tokenization (pydata#3515)
  format indexing.rst code with black (pydata#3511)
  add missing pint integration tests (pydata#3508)
  DOC: update bottleneck repo url (pydata#3507)
  add drop_sel, drop_vars, map to api.rst (pydata#3506)
  remove syntax warning (pydata#3505)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 22, 2019
* master: (24 commits)
  Tweaks to release instructions (pydata#3555)
  Clarify conda environments for new contributors (pydata#3551)
  Revert to dev version
  0.14.1 whatsnew (pydata#3547)
  sparse option to reindex and unstack (pydata#3542)
  Silence sphinx warnings (pydata#3516)
  Numpy 1.18 support (pydata#3537)
  tweak whats-new. (pydata#3540)
  small simplification of rename from pydata#3532 (pydata#3539)
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  ...
@andersy005 andersy005 deleted the fix-fill-missing-values branch December 11, 2019 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants