Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rules for propagating attrs and encoding #1614

Open
jhamman opened this issue Oct 9, 2017 · 15 comments
Open

Rules for propagating attrs and encoding #1614

jhamman opened this issue Oct 9, 2017 · 15 comments
Labels
topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)

Comments

@jhamman
Copy link
Member

jhamman commented Oct 9, 2017

We need to come up with some clear rules for when and how xarray should propagate metadata (attrs/encoding). This has come up routinely (e.g. #25, #138, #442, #688, #828, #988, #1009, #1271, #1297, #1586) and we don't have a clear direction as to when to keep/drop metadata.

I'll take a first cut:

operation attrs encoding status
reduce drop drop
arithmetic drop drop implemented
copy keep keep
concat keep first keep first implemented
slice keep drop
where keep keep

cc @shoyer (following up on #1586 (comment))

@ethan-campbell
Copy link

I'd also suggest that a global option of always_keep_attrs=True would be useful. While I understand the logic of dropping units during certain operations, it makes attributes unusable for storing other miscellaneous metadata, e.g. on data provenance. As a recent xarray convert, this behavior has been frustrating.

@mraspaud
Copy link
Contributor

mraspaud commented Feb 2, 2018

This issue is very relevant for me too. I would like to also propose that a user could provide a function that would know how to combine the attrs of different DataArrays.

@brey
Copy link

brey commented Feb 2, 2018

I am also interested. In terms of the table from @jhamman I am in principle ok with. However, there could be an option to refer to the original attrs in order to provide provenance even on operations like reduce and arithmetic. The idea here is reproducibility and tractability. Maybe an 'origin' attribute?

@shoyer
Copy link
Member

shoyer commented Feb 3, 2018

The challenge with a user-specified function is that there can potentially be weird conflicts if multiple libraries try to override it. Possibly it's worth it for the convenience, but subclasses allowing for explicit hooks (like numpy) is probably the cleanest solution.

@SeanDS
Copy link

SeanDS commented Jun 18, 2018

Hi, this feature would be very relevant to the intended use case of a project I'd like to use xarray for. Is the behaviour discussed in the first post implemented anywhere, e.g. in the trunk, for me to play with?

@SeanDS
Copy link

SeanDS commented Jun 18, 2018

Also - might I suggest you consider some kind of history tracker as part of the metadata propagation? Perhaps metadata could be saved from each step of a set of operations, so that there is a full paper trail for the set of operations have been applied to the data. It could however get complicated to merge together objects with their own separate histories, especially if they ultimately descend from the same original data set.

This would be very relevant for scientific analyses.

@shoyer
Copy link
Member

shoyer commented Jun 18, 2018

Hi, this feature would be very relevant to the intended use case of a project I'd like to use xarray for. Is the behaviour discussed in the first post implemented anywhere, e.g. in the trunk, for me to play with?

are you referring to a different issue? the first post only summarizes some simple proposed rules.

@shoyer
Copy link
Member

shoyer commented Jun 18, 2018

Also - might I suggest you consider some kind of history tracker as part of the metadata propagation?

Certainly this would be out of scope for xarray itself, but this perhaps be done with a library that wraps xarray's API. If I recall correctly, @pwolfram was also interested in this.

We did discuss customizable hooks for attribute handling in #988 but I'm no longer sure that is a good idea. These sort of overloads are really hard to get right, as we've seen with NumPy's long history of different override protocols (the most recent example being __array_ufunc__).

@max-sixty
Copy link
Collaborator

max-sixty commented Jun 18, 2018

consider some kind of history tracker as part of the metadata propagation?

Data lineage is a big, hard, unsolved problem (for us internally, above both naming things and cache invalidation :) )

To second @shoyer, I think it's big and difficult enough to be a separate library

@SeanDS
Copy link

SeanDS commented Jun 18, 2018

are you referring to a different issue? the first post only summarizes some simple proposed rules.

No, just the proposed feature to keep or delete metadata based on the various operations. Is this behaviour already part of the library, and this issue is just to clarify the intended behaviour, or is this a feature proposal?

@shoyer
Copy link
Member

shoyer commented Jun 18, 2018

No, just the proposed feature to keep or delete metadata based on the various operations. Is this behaviour already part of the library, and this issue is just to clarify the intended behaviour, or is this a feature proposal?

We already have most of this behavior (matching what @jhamman lists in the first comment), though it isn't clearly documented. It should just work if you use xarray methods/functions.

@ethan-campbell
Copy link

@shoyer, I assume you are referring to the keep_attrs option. Is there a way to persist attrs during arithmetic options? I find myself writing a bunch of boilerplate to transfer the wealth of metadata included with most netCDF files.

I realize that adding a module-level or DataArray instance-specific maintain_attrs configuration flag (as discussed in #131, #988, #1271) could be problematic, but this strikes me as complexity worth adding. The current approach of dropping all metadata (not just units) seems heavy-handed and unintuitive for new/casual users. As you mentioned in #1271, better to have stale metadata than no metadata at all.

@shoyer
Copy link
Member

shoyer commented Jun 18, 2018

I would happy to add a global keep_attrs option to xarray.set_options(), which we could use for controlling arithmetic. I'm not planning on working on it personally, but I would be happy to review a PR.

@gerritholl
Copy link
Contributor

Another one to decide is xarray.zeros_like(...) and friends.

@shoyer
Copy link
Member

shoyer commented Nov 3, 2018

I would happy to add a global keep_attrs option to xarray.set_options(), which we could use for controlling arithmetic. I'm not planning on working on it personally, but I would be happy to review a PR.

Note that this was implemented by @TomNicholas in #2482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)
Projects
None yet
Development

No branches or pull requests

9 participants