-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't set encoding attributes on bounds variables. #2965
Conversation
Fixes pydata#2921 1. Removes certain attributes from bounds variables on encode. 2. open_mfdataset: Sets encoding on variables based on encoding in first file.
Thanks for putting this together. I think this can work as long as |
Yeah, this still doesn't work as @klindsay28 just pointed out to me. I still don't understanding how the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on @dcherian. I think this is on the right track. Do you have an explicit example of the edge case(s) you still have in mind?
Thanks @spencerkclark, I'm going by the example in #2921. I've added that now as a test. I also test for the slight non-compliance case where time_bounds has different units already set. @mathause would this test catch the issue you were encountering? |
Great - this does indeed solve my problem (read multiple files with import xarray as xr
import pandas as pd
time = pd.date_range('2000-01-16', periods=1)
time_bounds = pd.date_range('2000-01-01', periods=2, freq='MS')
ds = xr.Dataset(dict(time=time, time_bounds=time_bounds))
ds.time.attrs['bounds'] = 'time_bounds'
xr.conventions.cf_encoder(ds.variables, ds.attrs) (compare the units). This might be difficult to solve, as xarray currently assumes all variables can be encoded independently. |
I was hoping someone wouldn't bring that case up :). But I agree I think it's something we should discuss. I'm sort of torn on whether it is worth the complexity. Part of me feels like if a user is concerned/clever enough to explicitly create such linking attributes (e.g. import xarray as xr
import pandas as pd
time = pd.date_range('2000-01-16', periods=1)
time_bounds = pd.date_range('2000-01-01', periods=2, freq='MS')
ds = xr.Dataset(dict(time=time, time_bounds=time_bounds))
ds.time.encoding['bounds'] = 'time_bounds'
ds.time.encoding['units'] = 'days since 2000-01-01'
ds.time.encoding['calendar'] = 'proleptic_gregorian'
ds.time_bounds.encoding['units'] = ds.time.encoding['units']
ds.time_bounds.encoding['calendar'] = ds.time.encoding['calendar'] That said, that case is something that in principle we could address, and maybe it is worth thinking about if we ever consider increasing the functionality related to cell bounds coordinates (e.g. #1475). |
if attr in new_vars[bounds].attrs and attr in var.attrs: | ||
if new_vars[bounds].attrs[attr] == var.attrs[attr]: | ||
new_vars[bounds].attrs.pop(attr) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we issue a warning here?
else:
warning.warn("The attribute 'units' is not the same in the variable "
"'time' and it's associated bounds 'time_bnds',"
" which is not cf-compliant.")
or some such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to? xarray allows writing CF-non-compliant files anyway...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the warning you added is the perfect approach. It will still be issued in the case of @mathause's example, but will still allow a user to write a non-CF-compliant file without a warning if the encoding attributes do not need to be computed on the fly.
Yes, I agree - I don't intend to actually create such a dataset by hand anytime soon ;) but it would be super hard to track this issue down - can we issue a warning instead? See my inline comment. |
This case might actually be really easy to fix since we already have I think we would just need to call this prior to calling Line 622 in 612d390
def _update_bounds_attributes(variables):
"""Adds time attributes to time bounds variables.
Variables handling time bounds ("Cell boundaries" in the CF
conventions) do not necessarily carry the necessary attributes to be
decoded. This copies the attributes from the time variable to the
associated boundaries.
See Also:
http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/
cf-conventions.html#cell-boundaries
https://github.com/pydata/xarray/issues/2565
"""
# For all time variables with bounds
for v in variables.values():
attrs = v.attrs
has_date_units = 'units' in attrs and 'since' in attrs['units']
if has_date_units and 'bounds' in attrs:
if attrs['bounds'] in variables:
bounds_attrs = variables[attrs['bounds']].attrs
bounds_attrs.setdefault('units', attrs['units'])
if 'calendar' in attrs:
bounds_attrs.setdefault('calendar', attrs['calendar']) |
I'm afraid it's not quite that simple :). In cases where someone creates a datetime-like variable in memory (e.g. with |
Thanks @spencerkclark & @mathause. I now understand the issue better. #2921 was confusing in that @klindsay28 was trying to write an encoded dataset. I've updated the tests to use @mathuse's example. The code now updates the Do you think this is a good approach? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
# For all time variables with bounds | ||
for v in variables.values(): | ||
attrs = v.attrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general question -- would we consider 'bounds'
to be an encoding parameter (like 'units'
or 'calendar'
)? In other words should we expect it to be in the encoding
dictionary or attrs
dictionary at this stage? I feel like it may be more intuitive as part of encoding
, but currently I know that we don't treat it that way when decoding files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mental model, encoding attributes are those that control on-disk representation of the data. bounds
counts as an attr to my mind since it's an attribute that links the variable to another variable.
A definition or list of what goes in encoding
and what goes in attrs
would make a good addition to the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mental model, encoding attributes are those that control on-disk representation of the data.
I think this is fair; I guess I was going off of the mental model of encoding parameters defined as "attributes that are potentially required for decoding all the variables in a file," in which case 'bounds'
could qualify. I think your definition is probably cleaner, because it requires that encoding parameters control how the variable they are attached to is represented on disk (as opposed to another variable).
* master: (31 commits) Add quantile method to GroupBy (pydata#2828) rolling_exp (nee ewm) (pydata#2650) Ensure explicitly indexed arrays are preserved (pydata#3027) add back dask-dev tests (pydata#3025) ENH: keepdims=True for xarray reductions (pydata#3033) Revert cmap fix (pydata#3038) Add "errors" keyword argument to drop() and drop_dims() (pydata#2994) (pydata#3028) More consistency checks (pydata#2859) Check types in travis (pydata#3024) Update issue templates (pydata#3019) Add pytest markers to avoid warnings (pydata#3023) Feature/merge errormsg (pydata#2971) More support for missing_value. (pydata#2973) Use flake8 rather than pycodestyle (pydata#3010) Pandas labels deprecation (pydata#3016) Pytest capture uses match, not message (pydata#3011) dask-dev tests to allowed failures in travis (pydata#3014) Fix 'to_masked_array' computing dask arrays twice (pydata#3006) str accessor (pydata#2991) fix safe_cast_to_index (pydata#3001) ...
… fix/bounds_encode_2 * 'fix/bounds_encode_2' of github.com:dcherian/xarray:
Issue pydata#2921 is about mismatching time units between a time variable and its "bounds" companion. However, pydata#2965 does more than fixing pydata#2921, it removes all double attributes from "bounds" variables which has the undesired side effect that there is currently no way to save them to netcdf with xarray. Since the mentioned link is a recommendation and not a hard requirement for CF compliance, these attributes should be left to the caller to prepare the dataset variables appropriately if required. Reduces the amount of surprise that attributes are not written to disk and fixes pydata#8368.
Here's a proposed fix for #2436 and #2921. Ping @spencerkclark @mathause @klindsay28
whats-new.rst
for all changes andapi.rst
for new API