Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental] Refactor Dataset to store variables in a manifest #5961

Closed

Conversation

TomNicholas
Copy link
Member

This PR is part of an experiment to see how to integrate a DataTree into xarray.

What is does is refactor Dataset to store variables in a DataManifest class, which is also capable of maintaining a ledger of child tree nodes. The point of this is to prevent name collisions between stored variables and child datatree nodes, as first mentioned in https://github.com/TomNicholas/datatree/issues/38 and explained further in https://github.com/TomNicholas/datatree/issues/2.

("Manifest" in the old sense, of a noun meaning "a document giving comprehensive details of a ship and its cargo and other contents")

Comment on lines +1147 to +1149
self._manifest.variables = variables
# TODO if ds._variables properly pointed to ds._manifest.variables we wouldn't need this line
self._variables = self._manifest.variables
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I wanted was for self._variables to always point to self._manifest.variables, such that updating the former by definition updates the latter. But I'm not really sure if that kind of pointer-like behaviour is possible.

Copy link
Member Author

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer What do you think about the way I've extended Dataset by adding a ._manifest attribute?

I could have instead only altered ._variables to point to the manifest. but then I end up with confusing code such as

class Dataset:
    @property
    def variables(self):
        return self._variables.variables

(I was pleasantly surprised by the very small number of places I needed to change the code to accommodate the manifest - the way that practically everything goes through ._replace() makes it pretty easy to keep all the tests passing.)

Comment on lines +1101 to +1112
@classmethod
def _construct_from_manifest(
cls,
manifest,
coord_names,
dims=None,
attrs=None,
indexes=None,
encoding=None,
close=None,
):
"""Creates a Dataset that is forced to be consistent with a DataTree node that shares its manifest."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to use this when someone calls dt.ds - so that the DataTree returns an actual Dataset, but one whose contents are linked to the wrapping tree node.

@shoyer
Copy link
Member

shoyer commented Nov 9, 2021

How about making custom Mapping for use as Dataset._variables directly, which directly is a mapping of dataset variables? You could still be storing the underlying variables in a different way.

@TomNicholas
Copy link
Member Author

TomNicholas commented Nov 9, 2021 via email

@shoyer
Copy link
Member

shoyer commented Nov 9, 2021

From a xarray.Dataset perspective, Dataset._variables just needs to be a MutableMapping of xarray.Variable objects. And in most cases (when not using DataTree), _variables would still be a plain dictionary, which means adding DataTree support would have no performance implications for normal Dataset objects.

My tentative suggestion would be to use a mixed dictionary with either xarray.Variable or nested dictionaries as entries for the data in DataTree. Then you could make a proxy mapping object that only exposes Variable objects for use in Dataset.

@TomNicholas
Copy link
Member Author

TomNicholas commented Nov 10, 2021

From a xarray.Dataset perspective, Dataset._variables just needs to be a MutableMapping of xarray.Variable objects. And in most cases (when not using DataTree), _variables would still be a plain dictionary, which means adding DataTree support would have no performance implications for normal Dataset objects.

That sounds nice, and might not require any changes to Dataset at all!

My tentative suggestion would be to use a mixed dictionary with either xarray.Variable or nested dictionaries as entries for the data in DataTree.

I think it's a lot easier to have a dict of DataTree objects rather than a nested dict of data, as then each node just points to its child nodes instead of having a node which knows about all the data in the whole tree (if that's what you meant).

How about making custom Mapping for use as Dataset._variables directly, which directly is a mapping of dataset variables?

So this is my understanding of what you're suggesting - I'm just not sure if it solves all the requirements:

class DataManifest(MutableMapping):
    """
    Acts like a dict of keys to variables, but 
    prevents setting variables to same key as any 
    children
    """
    def __init__(self, variables={}, children={}):
        # check for collisions here
        self._variables = {}
        self._children = {}

    def __getitem__(self, key):
        # only expose the variables so this acts like a normal dict of variables
        return self._variables[key]

    def __setitem__(self, key, var):
        if key in self._children:
            raise KeyError(
              "key already in use to denote a child" 
              "node in wrapping DataTree node"
            )
        self.__dict__[key] = var


class Dataset:
    self._variables = Mapping[Any, Variable]
    # in standard case just use dict of vars as before

    # Use ._construct_direct as the constructor
    # as it allows for setting ._variables directly

    # therefore no changes to Dataset required!


class DataTree:
    def __init__(self, name, data, parent, children):
        self._children
        self._variables
        self._coord_names
        self._dims
        ...
    
    @property
    def ds(self):
        manifest = DataManifest(variables, children)
        return Dataset._from_treenode(
          variables=manifest,
          coord_names=self._coord_names,
          dims=self._dims,
          ...
        )

    @ds.setter
    def ds(self, ds):
        # check for collisions between ds.data_vars and self.children
        ...


----------------

ds = Dataset({'a': 0})
subtree1 = Datatree('group1')
dt = Datatree('root', data=ds, children=[subtree])

wrapped_ds = dt.ds
wrapped_ds['group1'] = 1  # raises KeyError - good!

subtree2 = Datatree('b')
dt.ds['b'] = 2  # this will happily add a variable to the dataset
dt.add_child(subtree2)  # want to ensure this raises a KeyError as it conflicts with the new variable, but with this design I'm not sure if it will...

EDIT: Actually maybe this would work? So long as in DataTree we have

class DataTree:
    self._variables = manifest
    self._children = manifest.children

Then adding a new child node would also update the manifest, meaning that the linked dataset should know about it too...

@TomNicholas
Copy link
Member Author

TomNicholas commented Nov 10, 2021

Update: I tried making a custom mapping class (code in drop-down below), then swapping out ._variables = dict(variables) for ._variables = DataManifest(variables=variables) in the Dataset constructors to see if that would break anything, as a first step towards that kind of integration. (At some point it would be good to be able to automatically run the test_dataset.py tests again for the manifest case.)

It kind of works?

From a xarray.Dataset perspective, Dataset._variables just needs to be a MutableMapping of xarray.Variable objects.

It's not quite as simple as this - you need a .copy method (fine), a repr (okay), and there are several places inside Dataset and DataArray that explicitly check that the type of ._variables is a dict.

To get tests to pass I can either relax those type constraints (which leads to >2/3 of test_dataset.py passing immediately) or maybe try making DataManifest inherit from dict so that it passes isinstance(ds._variables, dict)? This probably deserves a new PR...

(EDIT: Though maybe inheriting from dict is more trouble than it's worth)

Code for custom mapping class
from collections.abc import MutableMapping
from typing import Dict, Hashable, Mapping, Iterator, Sequence

from xarray.core.variable import Variable
#from xarray.tree.datatree import DataTree


class DataTree:
    """Purely for type hinting purposes for now (and to avoid a circular import)"""
    ...


class DataManifest(MutableMapping):
    """
    Stores variables like a dict, but also stores children alongside in a hidden manner, to check against.

    Acts like a dict of keys to variables, but prevents setting variables to same key as any children. It prevents name
    collisions by acting as a common record of stored items for both the DataTree instance and its wrapped Dataset instance.
    """

    def __init__(
        self,
        variables: Dict[Hashable, Variable] = {},
        children: Dict[Hashable, DataTree] = {},
    ):
        if variables and children:
            keys_in_both = set(variables.keys()) & set(children.keys())
            if keys_in_both:
                raise KeyError(
                    f"The keys {keys_in_both} exist in both the variables and child nodes"
                )

        self._variables = variables
        self._children = children

    @property
    def children(self) -> Dict[Hashable, DataTree]:
        """Stores list of the node's children"""
        return self._children

    @children.setter
    def children(self, children: Dict[Hashable, DataTree]):
        for key, child in children.items():
            if key in self.keys():
                raise KeyError("Cannot add child under key {key} because a variable is already stored under that key")

            if not isinstance(child, DataTree):
                raise TypeError

        self._children = children

    def __getitem__(self, key: Hashable) -> Variable:
        """Forward to the variables here so the manifest acts like a normal dict of variables"""
        return self._variables[key]

    def __setitem__(self, key: Hashable, value: Variable):
        """Allow adding new variables, but first check if they conflict with children"""

        if key in self._children:
            raise KeyError(
                f"key {key} already in use to denote a child"
                "node in wrapping DataTree node"
            )

        if isinstance(value, Variable):
            self._variables[key] = value
        else:
            raise TypeError(f"Cannot store object of type {type(value)}")

    def __delitem__(self, key: Hashable):
        """Forward to the variables here so the manifest acts like a normal dict of variables"""
        if key in self._variables:
            del self._variables[key]
        elif key in self.children:
            # TODO might be better not to del children here?
            del self._children[key]
        else:
            raise KeyError(f"Cannot remove item because nothing is stored under {key}")

    def __contains__(self, item: object) -> bool:
        """Forward to the variables here so the manifest acts like a normal dict of variables"""
        return item in self._variables

    def __iter__(self) -> Iterator:
        """Forward to the variables here so the manifest acts like a normal dict of variables"""
        return iter(self._variables)

    def __len__(self) -> int:
        """Forward to the variables here so the manifest acts like a normal dict of variables"""
        return len(self._variables)

    def copy(self) -> "DataManifest":
        """Required for consistency with dict"""
        return DataManifest(variables=self._variables.copy(), children=self._children.copy())

    # TODO __repr__

@TomNicholas
Copy link
Member Author

Question: Does this change to ._variables need to propagate down to DatasetCoordinates and DataArrayCoordinates? I'm not sure what the intended behaviour is if the user alters .coords directly.

  1. It seems it is possible to alter a ds via ds.coords[new_coord_name] = new_coord:
In [34]: ds = xr.Dataset({'a': 0})

In [35]: ds.coords['c'] = 2

In [36]: ds
Out[36]: 
<xarray.Dataset>
Dimensions:  ()
Coordinates:
    c        int64 2
Data variables:
    a        int64 0

(That's a bit weird given that the docstring of DatasetCoordinates describes it as an "immutable dictionary" 😕)

  1. It also seems it is possible to similarly alter a da via da.coords[new_coord_name] = new_coord:
In [30]: da = xr.DataArray(0)

In [31]: da.coords['c'] = 1

In [32]: da
Out[32]: 
<xarray.DataArray ()>
array(0)
Coordinates:
    c        int64 1
  1. However is it meant to be possible to alter a ds via ds[var].coords[new_coord_name] = new_coord? Because that currently silently fails to update:
In [37]: ds = xr.Dataset({'a': 0})

In [38]: ds['a'].coords['c'] = 2

In [39]: ds
Out[39]: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    a        int64 0

In [40]: ds['a']
Out[40]: 
<xarray.DataArray 'a' ()>
array(0)

In [41]: ds['a'].coords
Out[41]: 
Coordinates:
    *empty*

Bizarrely this does change though:

In [42]: coords = ds['a'].coords

In [43]: coords['c'] = 2

In [44]: coords
Out[44]: 
Coordinates:
    c        int64 2

If altering .coords is intended behaviour though that means that my DataManifest also has to be accessible from DataArrayCoordinates too, so that the wrapping DataTree node can know about any changes to dt.ds[var].coords.

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Dec 18, 2021
@TomNicholas
Copy link
Member Author

Closing for same reasons as in #6086 (comment)

@TomNicholas TomNicholas closed this Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class topic-internals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants