Proposed API and design for .ds data access #80

TomNicholas · 2022-04-28T15:30:09Z

Having come back to this project for the first time in a while, I want to propose a design for the .ds property that I think will solve a few problems at once.

Accessing data specific to a node via .ds is intuitive for our tree design, and if the .ds object has a dataset-like API it also neatly solves the ambiguity of whether a method should act on one node or the whole tree.

But returning an actual xr.Dataset as we do now causes a couple of problems:

Modifying state with .ds.__setitem__ causes consistency (1) headaches (2),
You can't refer to other variables in different nodes (e.g. .ds['../common_var']), which limits the usefulness of map_over_subtree and of the tree concept in general.

After we refactor DataTree to store Variable objects under ._variables directly instead of storing a Dataset object under ._ds, then .ds will have to reconstruct a dataset from its private attributes.

I propose that rather than constructing a Dataset object we instead construct and return a NodeDataView object, which has mostly the same API as xr.Dataset, but with a couple of key differences:

No mutation allowed (for now)

Whilst it could be nice if e.g. dt.ds[var] = da actually updated dt, that is really finicky, and for now at least it's probably fine to just forbid it, and point users towards dt[var] = da instead.

Allow path-like access to DataArray objects stored in other nodes via __getitem__

One of the primary motivations of a tree is to allow computations on leaf nodes to refer to common variables stored further up the tree. For instance imagine I have heterogeneous datasets but I want to refer to a common "reference_pressure":

print(dt)

DataTree(parent=None)
│   Dimensions:  ()
│   Coordinates:
│     * reference_pressure  () 100.0
├── DataTree('observations')
│   └── DataTree('satellite')
│       Dimensions:  (x: 6, y: 3)
│       Dimensions without coordinates: x,y
│       Data variables:
│           p        (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298
└── DataTree('simulations')
    ├── DataTree('coarse')
    │   Dimensions:  (x: 2, y: 3)
    │   Coordinates:
    │     * x        (x) int64 10 20
    │   Dimensions without coordinates: y
    │   Data variables:
    │       p        (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298
    └── DataTree('fine')
        Dimensions:  (x: 6, y: 3)
        Coordinates:
          * x        (x) int64 10 12 14 16 18 20
        Dimensions without coordinates: y
        Data variables:
            p        (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298

I have a function which accepts and consumes datasets

def normalize_pressure(ds):
    ds['p'] = ds['p'] / ds['/reference_pressure']
    return ds

then map it over the tree

result_tree = dt.map_over_subtree(normalise_pressure)

If we allowed path-like access to data in other nodes from .ds then this would work because map_over_subtree applies normalise_pressure to the .ds attribute of every node, and '/reference_pressure' means "look for the variable 'reference_pressure' in the root node of the tree".

(In this case referring to the reference pressure with ds['../../reference_pressure'] would also have worked.)

(PPS if we chose to support the CF conventions' upwards proximity search behaviour then ds['reference_pressure'] would have worked too, because then __getitem__ would search upwards through the nodes for the first with a variable matching the desired name.)

A simple implementation could then just subclass xr.Dataset:

class NodeDataView(Dataset):
    _wrapping_node = DataTree

    @classmethod
    def _construct_direct(wrapper_node, variables, coord_names, ...) -> NodeDataView:
        ...

    def __setitem__(self, key, val):
        raise("NOT ALLOWED, please use __setitem__ on the wrapping DataTree node, "
              "or use `DataTree.to_dataset()` if you want a mutable dataset")

    def __getitem__(self, key) -> DataArray:
        # calling the `_get_item` method of DataTree allows path-like access to contents of other nodes
        obj = self._wrapping_node._get_item(key)
        if isinstance(obj, DataArray):
            return obj
        else:
            raise ValueError("NodeDataView is only allowed to return variables, not entire tree nodes")

    # all API that doesn't modify state in-place can just be inherited from Dataset
    ...


class DataTree:
    _variables = Mapping[Hashable, Variable]
    _coord_names = set[Hashable]
    ...

    @property
    def ds(self) -> NodeDataView:
        return NodeDataView._construct_direct(self, self._variables, self._coord_names, ...)

    def to_dataset(self) -> Dataset:
        return Dataset._construct_direct(self._variables, self._coord_names, ...)

    ...

If we don't like subclassing Dataset then we could cook up something similar using getattr instead.

(This idea is probably what @shoyer, @jhamman and others were already thinking but I'm mostly just writing it out here for my own benefit.)

The text was updated successfully, but these errors were encountered:

TomNicholas · 2022-04-29T20:49:49Z

To summarize, the syntax proposed above would be:

node manipulation

# create a new subgroup
dt['folder'] = DataTree()  # works

computation

# apply a method to the whole subtree, returning a new tree
dt2 = dt['folder'].mean()  # works

# apply a method to the whole subtree, updating the whole subtree in-place
dt['folder'] = dt['folder'].mean()  # works

# apply a method to data in one node, returning a new dataset
ds2 = dt['folder'].ds.mean()  # works

# apply a method to update data in one node in-place 
dt['folder'] = dt['folder'].ds.mean()  # works

variable manipulation

# add a new variable to a subgroup
dt['folder'] = Dataset({'var': 0})  # works
dt['folder/var'] = DataArray(0)  # works
dt['folder'].ds['var'] = DataArray(0)  # would be forbidden
dt['folder']['var'] = DataArray(0)  # works

# so DataTree objects act like a dict of both subgroups and data variables
dt.items() -> Mapping[str, DataTree | DataArray]

The overarching question for me is should I:
a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above),
b) restrict all data manipulation to dt.ds, only allowing node manipulation on dt (i.e. complete separation),
c) allow data manipulation on both dt.ds and dt.

malmans2 · 2022-05-25T08:59:55Z

Hi @TomNicholas,
@aurghs and I had a look together at this, and the API you are proposing (i.e., a) seems the best way to deal with the issues you mentioned above. So 👍 from us.

shoyer · 2022-05-25T16:17:50Z

I like option (a): a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above).

This is a minor nit, but I would prefer a full name like .dataset or .node instead of .ds.

TomNicholas · 2022-05-25T19:16:49Z

I like option (a): a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above).

I think I'm going to do that first just because it's the easiest one to implement anyway. If we decide to allow mutation later that wouldn't break anyone's code.

This is a minor nit, but I would prefer a full name like .dataset or .node instead of .ds.

Same, but for method-chaining it's nice to have a short name... I would use .data but that already has a meaning on DataArray, which might be confusing. I could have both, make .ds an alias for .dataset and use .dataset in the examples? .node is a possiblity though...

agrouaze · 2022-10-11T11:14:53Z

I understood that the a) option won.
Just to mention that the option c) is definitely the one that would be the best from my point of view.
the "binding" dt<->ds was somehow possible in version 0.0.6 and it allowed to do smooth transition from xarrayDataset manipulation to datatree.Datatree manipulations.

TomNicholas · 2022-12-31T00:41:32Z

@agrouaze yes currently (a) is what's implemented. dt.ds returns an immutable DatasetView to the contents of dt.

"binding" dt<->ds was somehow possible in version 0.0.6

The binding was two-directional at that point yes, but it was also possible to manipulate the objects in a way that led to an inconsistent state.

it allowed to do smooth transition from xarray.Dataset manipulation to datatree.Datatree manipulations.

I would love to know what exactly it is that you would like to do that you cannot do now with some combination of dt.ds, dt.to_dataset(), and the dataset methods on dt.ds (i.e. dt.ds.mean() currently returns an xarray.Dataset).

TomNicholas · 2024-08-13T19:33:20Z

I think really this should have been considered closed by #99. @agrouaze if you feel this solution doesn't support your use case then please open a new issue on the main xarray repository!

TomNicholas added enhancement New feature or request design question labels Apr 28, 2022

TomNicholas added this to the 0.1 milestone Apr 28, 2022

TomNicholas mentioned this issue May 24, 2022

Replace .ds with immutable DatasetView #99

Merged

5 tasks

TomNicholas mentioned this issue May 28, 2022

Out of node access to variables in .ds #102

Closed

5 tasks

shaprann mentioned this issue Jan 31, 2023

Why are tree nodes and DataArrays treated equally by __getitem__() and __setitem__()? #211

Closed

TomNicholas mentioned this issue Apr 27, 2023

Assigning coords not consistent with xarray #242

Closed

TomNicholas mentioned this issue Nov 27, 2023

Use path-like tools with DataTree #281

Closed

TomNicholas closed this as completed Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed API and design for .ds data access #80

Proposed API and design for .ds data access #80

TomNicholas commented Apr 28, 2022 •

edited

Loading

TomNicholas commented Apr 29, 2022

malmans2 commented May 25, 2022 •

edited

Loading

shoyer commented May 25, 2022

TomNicholas commented May 25, 2022

agrouaze commented Oct 11, 2022

TomNicholas commented Dec 31, 2022 •

edited

Loading

TomNicholas commented Aug 13, 2024

Proposed API and design for .ds data access #80

Proposed API and design for .ds data access #80

Comments

TomNicholas commented Apr 28, 2022 • edited Loading

TomNicholas commented Apr 29, 2022

node manipulation

computation

variable manipulation

malmans2 commented May 25, 2022 • edited Loading

shoyer commented May 25, 2022

TomNicholas commented May 25, 2022

agrouaze commented Oct 11, 2022

TomNicholas commented Dec 31, 2022 • edited Loading

TomNicholas commented Aug 13, 2024

TomNicholas commented Apr 28, 2022 •

edited

Loading

malmans2 commented May 25, 2022 •

edited

Loading

TomNicholas commented Dec 31, 2022 •

edited

Loading