Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Proposed API and design for .ds data access #80

Closed
TomNicholas opened this issue Apr 28, 2022 · 7 comments
Closed

Proposed API and design for .ds data access #80

TomNicholas opened this issue Apr 28, 2022 · 7 comments
Labels
design question enhancement New feature or request
Milestone

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Apr 28, 2022

Having come back to this project for the first time in a while, I want to propose a design for the .ds property that I think will solve a few problems at once.

Accessing data specific to a node via .ds is intuitive for our tree design, and if the .ds object has a dataset-like API it also neatly solves the ambiguity of whether a method should act on one node or the whole tree.

But returning an actual xr.Dataset as we do now causes a couple of problems:

  1. Modifying state with .ds.__setitem__ causes consistency (1) headaches (2),
  2. You can't refer to other variables in different nodes (e.g. .ds['../common_var']), which limits the usefulness of map_over_subtree and of the tree concept in general.

After we refactor DataTree to store Variable objects under ._variables directly instead of storing a Dataset object under ._ds, then .ds will have to reconstruct a dataset from its private attributes.


I propose that rather than constructing a Dataset object we instead construct and return a NodeDataView object, which has mostly the same API as xr.Dataset, but with a couple of key differences:

  1. No mutation allowed (for now)

    Whilst it could be nice if e.g. dt.ds[var] = da actually updated dt, that is really finicky, and for now at least it's probably fine to just forbid it, and point users towards dt[var] = da instead.

  2. Allow path-like access to DataArray objects stored in other nodes via __getitem__

    One of the primary motivations of a tree is to allow computations on leaf nodes to refer to common variables stored further up the tree. For instance imagine I have heterogeneous datasets but I want to refer to a common "reference_pressure":

    print(dt)
    
    DataTree(parent=None)
    │   Dimensions:  ()
    │   Coordinates:
    │     * reference_pressure  () 100.0
    ├── DataTree('observations')
    │   └── DataTree('satellite')
    │       Dimensions:  (x: 6, y: 3)
    │       Dimensions without coordinates: x,y
    │       Data variables:
    │           p        (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298
    └── DataTree('simulations')
        ├── DataTree('coarse')
        │   Dimensions:  (x: 2, y: 3)
        │   Coordinates:
        │     * x        (x) int64 10 20
        │   Dimensions without coordinates: y
        │   Data variables:
        │       p        (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298
        └── DataTree('fine')
            Dimensions:  (x: 6, y: 3)
            Coordinates:
              * x        (x) int64 10 12 14 16 18 20
            Dimensions without coordinates: y
            Data variables:
                p        (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298
    

    I have a function which accepts and consumes datasets

    def normalize_pressure(ds):
        ds['p'] = ds['p'] / ds['/reference_pressure']
        return ds

    then map it over the tree

    result_tree = dt.map_over_subtree(normalise_pressure)

    If we allowed path-like access to data in other nodes from .ds then this would work because map_over_subtree applies normalise_pressure to the .ds attribute of every node, and '/reference_pressure' means "look for the variable 'reference_pressure' in the root node of the tree".

    (In this case referring to the reference pressure with ds['../../reference_pressure'] would also have worked.)

    (PPS if we chose to support the CF conventions' upwards proximity search behaviour then ds['reference_pressure'] would have worked too, because then __getitem__ would search upwards through the nodes for the first with a variable matching the desired name.)


A simple implementation could then just subclass xr.Dataset:

class NodeDataView(Dataset):
    _wrapping_node = DataTree

    @classmethod
    def _construct_direct(wrapper_node, variables, coord_names, ...) -> NodeDataView:
        ...

    def __setitem__(self, key, val):
        raise("NOT ALLOWED, please use __setitem__ on the wrapping DataTree node, "
              "or use `DataTree.to_dataset()` if you want a mutable dataset")

    def __getitem__(self, key) -> DataArray:
        # calling the `_get_item` method of DataTree allows path-like access to contents of other nodes
        obj = self._wrapping_node._get_item(key)
        if isinstance(obj, DataArray):
            return obj
        else:
            raise ValueError("NodeDataView is only allowed to return variables, not entire tree nodes")

    # all API that doesn't modify state in-place can just be inherited from Dataset
    ...


class DataTree:
    _variables = Mapping[Hashable, Variable]
    _coord_names = set[Hashable]
    ...

    @property
    def ds(self) -> NodeDataView:
        return NodeDataView._construct_direct(self, self._variables, self._coord_names, ...)

    def to_dataset(self) -> Dataset:
        return Dataset._construct_direct(self._variables, self._coord_names, ...)

    ...

If we don't like subclassing Dataset then we could cook up something similar using getattr instead.

(This idea is probably what @shoyer, @jhamman and others were already thinking but I'm mostly just writing it out here for my own benefit.)

@TomNicholas TomNicholas added enhancement New feature or request design question labels Apr 28, 2022
@TomNicholas TomNicholas added this to the 0.1 milestone Apr 28, 2022
@TomNicholas
Copy link
Member Author

To summarize, the syntax proposed above would be:

node manipulation

# create a new subgroup
dt['folder'] = DataTree()  # works

computation

# apply a method to the whole subtree, returning a new tree
dt2 = dt['folder'].mean()  # works

# apply a method to the whole subtree, updating the whole subtree in-place
dt['folder'] = dt['folder'].mean()  # works

# apply a method to data in one node, returning a new dataset
ds2 = dt['folder'].ds.mean()  # works

# apply a method to update data in one node in-place 
dt['folder'] = dt['folder'].ds.mean()  # works

variable manipulation

# add a new variable to a subgroup
dt['folder'] = Dataset({'var': 0})  # works
dt['folder/var'] = DataArray(0)  # works
dt['folder'].ds['var'] = DataArray(0)  # would be forbidden
dt['folder']['var'] = DataArray(0)  # works

# so DataTree objects act like a dict of both subgroups and data variables
dt.items() -> Mapping[str, DataTree | DataArray]

The overarching question for me is should I:
a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above),
b) restrict all data manipulation to dt.ds, only allowing node manipulation on dt (i.e. complete separation),
c) allow data manipulation on both dt.ds and dt.

@malmans2
Copy link
Member

malmans2 commented May 25, 2022

Hi @TomNicholas,
@aurghs and I had a look together at this, and the API you are proposing (i.e., a) seems the best way to deal with the issues you mentioned above. So 👍 from us.

@shoyer
Copy link

shoyer commented May 25, 2022

I like option (a): a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above).

This is a minor nit, but I would prefer a full name like .dataset or .node instead of .ds.

@TomNicholas
Copy link
Member Author

I like option (a): a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above).

I think I'm going to do that first just because it's the easiest one to implement anyway. If we decide to allow mutation later that wouldn't break anyone's code.

This is a minor nit, but I would prefer a full name like .dataset or .node instead of .ds.

Same, but for method-chaining it's nice to have a short name... I would use .data but that already has a meaning on DataArray, which might be confusing. I could have both, make .ds an alias for .dataset and use .dataset in the examples? .node is a possiblity though...

@agrouaze
Copy link

I understood that the a) option won.
Just to mention that the option c) is definitely the one that would be the best from my point of view.
the "binding" dt<->ds was somehow possible in version 0.0.6 and it allowed to do smooth transition from xarrayDataset manipulation to datatree.Datatree manipulations.

@TomNicholas
Copy link
Member Author

TomNicholas commented Dec 31, 2022

@agrouaze yes currently (a) is what's implemented. dt.ds returns an immutable DatasetView to the contents of dt.

"binding" dt<->ds was somehow possible in version 0.0.6

The binding was two-directional at that point yes, but it was also possible to manipulate the objects in a way that led to an inconsistent state.

it allowed to do smooth transition from xarray.Dataset manipulation to datatree.Datatree manipulations.

I would love to know what exactly it is that you would like to do that you cannot do now with some combination of dt.ds, dt.to_dataset(), and the dataset methods on dt.ds (i.e. dt.ds.mean() currently returns an xarray.Dataset).

@TomNicholas
Copy link
Member Author

I think really this should have been considered closed by #99. @agrouaze if you feel this solution doesn't support your use case then please open a new issue on the main xarray repository!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
design question enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants