-
Notifications
You must be signed in to change notification settings - Fork 41
Proposed API and design for .ds data access #80
Comments
To summarize, the syntax proposed above would be: node manipulation# create a new subgroup
dt['folder'] = DataTree() # works computation# apply a method to the whole subtree, returning a new tree
dt2 = dt['folder'].mean() # works
# apply a method to the whole subtree, updating the whole subtree in-place
dt['folder'] = dt['folder'].mean() # works
# apply a method to data in one node, returning a new dataset
ds2 = dt['folder'].ds.mean() # works
# apply a method to update data in one node in-place
dt['folder'] = dt['folder'].ds.mean() # works variable manipulation# add a new variable to a subgroup
dt['folder'] = Dataset({'var': 0}) # works
dt['folder/var'] = DataArray(0) # works
dt['folder'].ds['var'] = DataArray(0) # would be forbidden
dt['folder']['var'] = DataArray(0) # works
# so DataTree objects act like a dict of both subgroups and data variables
dt.items() -> Mapping[str, DataTree | DataArray] The overarching question for me is should I: |
Hi @TomNicholas, |
I like option (a): a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above). This is a minor nit, but I would prefer a full name like |
I think I'm going to do that first just because it's the easiest one to implement anyway. If we decide to allow mutation later that wouldn't break anyone's code.
Same, but for method-chaining it's nice to have a short name... I would use |
I understood that the |
@agrouaze yes currently (a) is what's implemented.
The binding was two-directional at that point yes, but it was also possible to manipulate the objects in a way that led to an inconsistent state.
I would love to know what exactly it is that you would like to do that you cannot do now with some combination of |
Having come back to this project for the first time in a while, I want to propose a design for the
.ds
property that I think will solve a few problems at once.Accessing data specific to a node via
.ds
is intuitive for our tree design, and if the.ds
object has a dataset-like API it also neatly solves the ambiguity of whether a method should act on one node or the whole tree.But returning an actual
xr.Dataset
as we do now causes a couple of problems:.ds.__setitem__
causes consistency (1) headaches (2),.ds['../common_var']
), which limits the usefulness ofmap_over_subtree
and of the tree concept in general.After we refactor
DataTree
to storeVariable
objects under._variables
directly instead of storing aDataset
object under._ds
, then.ds
will have to reconstruct a dataset from its private attributes.I propose that rather than constructing a
Dataset
object we instead construct and return aNodeDataView
object, which has mostly the same API asxr.Dataset
, but with a couple of key differences:No mutation allowed (for now)
Whilst it could be nice if e.g.
dt.ds[var] = da
actually updateddt
, that is really finicky, and for now at least it's probably fine to just forbid it, and point users towardsdt[var] = da
instead.Allow path-like access to
DataArray
objects stored in other nodes via__getitem__
One of the primary motivations of a tree is to allow computations on leaf nodes to refer to common variables stored further up the tree. For instance imagine I have heterogeneous datasets but I want to refer to a common "reference_pressure":
I have a function which accepts and consumes datasets
then map it over the tree
If we allowed path-like access to data in other nodes from
.ds
then this would work becausemap_over_subtree
appliesnormalise_pressure
to the.ds
attribute of every node, and'/reference_pressure'
means "look for the variable'reference_pressure'
in the root node of the tree".(In this case referring to the reference pressure with
ds['../../reference_pressure']
would also have worked.)(PPS if we chose to support the CF conventions' upwards proximity search behaviour then
ds['reference_pressure']
would have worked too, because then__getitem__
would search upwards through the nodes for the first with a variable matching the desired name.)A simple implementation could then just subclass
xr.Dataset
:If we don't like subclassing
Dataset
then we could cook up something similar usinggetattr
instead.(This idea is probably what @shoyer, @jhamman and others were already thinking but I'm mostly just writing it out here for my own benefit.)
The text was updated successfully, but these errors were encountered: