Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't set encoding attributes on bounds variables. #2965

Merged
merged 18 commits into from
Jun 25, 2019

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented May 15, 2019

Here's a proposed fix for #2436 and #2921. Ping @spencerkclark @mathause @klindsay28

  1. Removes certain attributes from bounds variables on encode.
  2. open_mfdataset: Sets encoding on variables based on encoding in first file.

Fixes pydata#2921

1. Removes certain attributes from bounds variables on encode.
2. open_mfdataset: Sets encoding on variables based on encoding in first file.
@pep8speaks
Copy link

pep8speaks commented May 15, 2019

Hello @dcherian! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-24 21:51:22 UTC

@mathause
Copy link
Collaborator

Thanks for putting this together.

I think this can work as long as time has an encoding - however, if it doesn't we might end up with different units for time and time_bounds again. I am not sure, actually, just something to check.

@dcherian
Copy link
Contributor Author

Yeah, this still doesn't work as @klindsay28 just pointed out to me.

I still don't understanding how the units attribute/encoding is being treated. I'll get back to this soon.

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on @dcherian. I think this is on the right track. Do you have an explicit example of the edge case(s) you still have in mind?

xarray/conventions.py Outdated Show resolved Hide resolved
@dcherian
Copy link
Contributor Author

Thanks @spencerkclark, I'm going by the example in #2921. I've added that now as a test. I also test for the slight non-compliance case where time_bounds has different units already set.

@mathause would this test catch the issue you were encountering?

@dcherian dcherian changed the title [WIP] Don't set attributes on bounds variables. Don't set attributes on bounds variables. May 17, 2019
@mathause
Copy link
Collaborator

Great - this does indeed solve my problem (read multiple files with open_mfdataset and write as one). However, it still 'fails' - is not cf-compiant to be more precise - for a new dataset created in xarray, see:

import xarray as xr
import pandas as pd
time = pd.date_range('2000-01-16', periods=1)
time_bounds = pd.date_range('2000-01-01', periods=2, freq='MS')

ds = xr.Dataset(dict(time=time, time_bounds=time_bounds))
ds.time.attrs['bounds'] = 'time_bounds'

xr.conventions.cf_encoder(ds.variables, ds.attrs)

(compare the units). This might be difficult to solve, as xarray currently assumes all variables can be encoded independently.

@spencerkclark
Copy link
Member

This might be difficult to solve, as xarray currently assumes all variables can be encoded independently.

I was hoping someone wouldn't bring that case up :). But I agree I think it's something we should discuss.

I'm sort of torn on whether it is worth the complexity. Part of me feels like if a user is concerned/clever enough to explicitly create such linking attributes (e.g. 'bounds' pointing to 'time_bounds') in their Datasets that we could leave it as their responsibility to make sure the 'units' and 'calendar' encodings match for each of the variables as well. For example, while a bit verbose, this would work as desired:

import xarray as xr
import pandas as pd
time = pd.date_range('2000-01-16', periods=1)
time_bounds = pd.date_range('2000-01-01', periods=2, freq='MS')

ds = xr.Dataset(dict(time=time, time_bounds=time_bounds))
ds.time.encoding['bounds'] = 'time_bounds'
ds.time.encoding['units'] = 'days since 2000-01-01'
ds.time.encoding['calendar'] = 'proleptic_gregorian'

ds.time_bounds.encoding['units'] = ds.time.encoding['units']
ds.time_bounds.encoding['calendar'] = ds.time.encoding['calendar']

That said, that case is something that in principle we could address, and maybe it is worth thinking about if we ever consider increasing the functionality related to cell bounds coordinates (e.g. #1475).

if attr in new_vars[bounds].attrs and attr in var.attrs:
if new_vars[bounds].attrs[attr] == var.attrs[attr]:
new_vars[bounds].attrs.pop(attr)

Copy link
Collaborator

@mathause mathause May 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we issue a warning here?

else:
    warning.warn("The attribute 'units' is not the same in the variable "
"'time' and it's associated bounds 'time_bnds',"
" which is not cf-compliant.")

or some such

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to? xarray allows writing CF-non-compliant files anyway...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the warning you added is the perfect approach. It will still be issued in the case of @mathause's example, but will still allow a user to write a non-CF-compliant file without a warning if the encoding attributes do not need to be computed on the fly.

@mathause
Copy link
Collaborator

Yes, I agree - I don't intend to actually create such a dataset by hand anytime soon ;) but it would be super hard to track this issue down - can we issue a warning instead? See my inline comment.

@dcherian
Copy link
Contributor Author

This case might actually be really easy to fix since we already have xr.conventions._update_bounds_attributes() from Fabien's previous PR (see below).

I think we would just need to call this prior to calling encode_cf_variable() in

new_vars = OrderedDict((k, encode_cf_variable(v, name=k))

def _update_bounds_attributes(variables):
    """Adds time attributes to time bounds variables.

    Variables handling time bounds ("Cell boundaries" in the CF
    conventions) do not necessarily carry the necessary attributes to be
    decoded. This copies the attributes from the time variable to the
    associated boundaries.

    See Also:

    http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/
         cf-conventions.html#cell-boundaries

    https://github.com/pydata/xarray/issues/2565
    """


    # For all time variables with bounds
    for v in variables.values():
        attrs = v.attrs
        has_date_units = 'units' in attrs and 'since' in attrs['units']
        if has_date_units and 'bounds' in attrs:
            if attrs['bounds'] in variables:
                bounds_attrs = variables[attrs['bounds']].attrs
                bounds_attrs.setdefault('units', attrs['units'])
                if 'calendar' in attrs:
                    bounds_attrs.setdefault('calendar', attrs['calendar'])

@spencerkclark
Copy link
Member

I think we would just need to call this prior to calling encode_cf_variable()

I'm afraid it's not quite that simple :).

In cases where someone creates a datetime-like variable in memory (e.g. with pd.date_range in @mathause's example), unless they explicitly add encoding attributes, 'units' and 'calendar' will need to be computed on the fly. This happens inside of encode_cf_variable, currently on a per-variable basis, so trying to make sure the attributes are equal before that step will not help, because they might not exist.

@dcherian dcherian changed the title Don't set attributes on bounds variables. [WIP] Don't set encoding attributes on bounds variables. May 19, 2019
@dcherian
Copy link
Contributor Author

dcherian commented May 19, 2019

Thanks @spencerkclark & @mathause. I now understand the issue better. #2921 was confusing in that @klindsay28 was trying to write an encoded dataset.

I've updated the tests to use @mathuse's example. The code now updates the time_bounds variable with the encoding 'units' and 'calendar' of the time variable. It also throws a warning when encoding.units is not specified for variables with a bounds attribute.

Do you think this is a good approach?

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dcherian -- @mathause's suggestion of adding a warning was helpful, and I think your addition to potentially propagate the 'units' and 'calendar' encoding parameters from the root to the bounds coordinate is also good. I just have a few more comments and a question.

xarray/conventions.py Outdated Show resolved Hide resolved
xarray/conventions.py Outdated Show resolved Hide resolved

# For all time variables with bounds
for v in variables.values():
attrs = v.attrs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question -- would we consider 'bounds' to be an encoding parameter (like 'units' or 'calendar')? In other words should we expect it to be in the encoding dictionary or attrs dictionary at this stage? I feel like it may be more intuitive as part of encoding, but currently I know that we don't treat it that way when decoding files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mental model, encoding attributes are those that control on-disk representation of the data. bounds counts as an attr to my mind since it's an attribute that links the variable to another variable.

A definition or list of what goes in encoding and what goes in attrs would make a good addition to the docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mental model, encoding attributes are those that control on-disk representation of the data.

I think this is fair; I guess I was going off of the mental model of encoding parameters defined as "attributes that are potentially required for decoding all the variables in a file," in which case 'bounds' could qualify. I think your definition is probably cleaner, because it requires that encoding parameters control how the variable they are attached to is represented on disk (as opposed to another variable).

xarray/tests/test_coding_times.py Show resolved Hide resolved
@dcherian dcherian changed the title [WIP] Don't set encoding attributes on bounds variables. Don't set encoding attributes on bounds variables. May 20, 2019
@dcherian dcherian mentioned this pull request May 21, 2019
15 tasks
xarray/backends/api.py Outdated Show resolved Hide resolved
dcherian and others added 4 commits June 24, 2019 17:47
* master: (31 commits)
  Add quantile method to GroupBy (pydata#2828)
  rolling_exp (nee ewm) (pydata#2650)
  Ensure explicitly indexed arrays are preserved (pydata#3027)
  add back dask-dev tests (pydata#3025)
  ENH: keepdims=True for xarray reductions (pydata#3033)
  Revert cmap fix (pydata#3038)
  Add "errors" keyword argument to drop() and drop_dims() (pydata#2994) (pydata#3028)
  More consistency checks (pydata#2859)
  Check types in travis (pydata#3024)
  Update issue templates (pydata#3019)
  Add pytest markers to avoid warnings (pydata#3023)
  Feature/merge errormsg (pydata#2971)
  More support for missing_value. (pydata#2973)
  Use flake8 rather than pycodestyle (pydata#3010)
  Pandas labels deprecation (pydata#3016)
  Pytest capture uses match, not message (pydata#3011)
  dask-dev tests to allowed failures in travis (pydata#3014)
  Fix 'to_masked_array' computing dask arrays twice (pydata#3006)
  str accessor (pydata#2991)
  fix safe_cast_to_index (pydata#3001)
  ...
… fix/bounds_encode_2

* 'fix/bounds_encode_2' of github.com:dcherian/xarray:
@dcherian dcherian merged commit 76adf13 into pydata:master Jun 25, 2019
@dcherian dcherian deleted the fix/bounds_encode_2 branch June 25, 2019 00:24
st-bender added a commit to st-bender/xarray that referenced this pull request Apr 10, 2024
Issue pydata#2921 is about mismatching time units between a time variable and
its "bounds" companion.
However, pydata#2965 does more than fixing pydata#2921, it removes all double
attributes from "bounds" variables which has the undesired side effect
that there is currently no way to save them to netcdf with xarray.
Since the mentioned link is a recommendation and not a hard requirement
for CF compliance, these attributes should be left to the caller to
prepare the dataset variables appropriately if required.
Reduces the amount of surprise that attributes are not written to disk
and fixes pydata#8368.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_netcdf with decoded time can create file with inconsistent time:units and time_bounds:units
5 participants