Name collisions between Dataset variables and child tree nodes #38

TomNicholas · 2021-09-03T18:58:24Z

I realised that it is currently possible to get a tree into a state which (a) cannot be represented as a netCDF file, and (b) means __getitem__ becomes ambiguous.

See this example:

In [3]: dt = DataNode('root', data=xr.Dataset({'a': [0], 'b': 1}))

In [4]: child = DataNode('a', data=None, parent=dt)

In [5]: print(dt)
DataNode('root')
│   Dimensions:  (a: 1)
│   Coordinates:
│     * a        (a) int64 0
│   Data variables:
│       b        int64 1
└── DataNode('a')

In [6]: dt['a']
Out[6]: 
<xarray.DataArray 'a' (a: 1)>
array([0])
Coordinates:
  * a        (a) int64 0

In [7]: dt.get_node('a')
Out[7]: DataNode(name='a', parent='root', children=[],data=None)

Here print(dt) shows that dt is in a form forbidden by netCDF, because we have a child node and a variable with the same name (equivalent to having a group and a variable with the same name at the same level in netcdf).

Furthermore, when choosing an item via DataTree.__getitem__ it merrily picks out the DataArray even though this is an ambiguous situation and I might have intended to pick out the child node 'a' instead.

The node is still accessible via .get_node, but only because .get_node is inherited from TreeNode, which has no concept of data variables.

Contrast this silent collision of variable and child names with what happens if you try to assign two children with the same name:

In [8]: child = DataNode('a', data=None, parent=dt)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-8c4a956bcdd5> in <module>
----> 1 child = DataNode('a', data=None, parent=dt)

~/Documents/Work/Code/datatree/datatree/datatree.py in _init_single_datatree_node(cls, name, data, parent, children)
    162         # This approach was inspired by xarray.Dataset._construct_direct()
    163         obj = object.__new__(cls)
--> 164         obj = _init_single_treenode(obj, name=name, parent=parent, children=children)
    165         obj.ds = data
    166         return obj

~/Documents/Work/Code/datatree/datatree/treenode.py in _init_single_treenode(obj, name, parent, children)
     13     obj.name = name
     14 
---> 15     obj.parent = parent
     16     if children:
     17         obj.children = children

~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in parent(self, value)
    133             self.__check_loop(value)
    134             self.__detach(parent)
--> 135             self.__attach(value)
    136 
    137     def __check_loop(self, node):

~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in __attach(self, parent)
    157     def __attach(self, parent):
    158         if parent is not None:
--> 159             self._pre_attach(parent)
    160             parentchildren = parent.__children_or_empty
    161             assert not any(child is self for child in parentchildren), "Tree is corrupt."  # pragma: no cover

~/Documents/Work/Code/datatree/datatree/treenode.py in _pre_attach(self, parent)
     84         """
     85         if self.name in list(c.name for c in parent.children):
---> 86             raise KeyError(
     87                 f"parent {parent.name} already has a child named {self.name}"
     88             )

KeyError: 'parent root already has a child named a'

To prevent this we need better checks on assignment between variables and children. For example TreeNode.set_node(key, new_child) currently checks for any existing children with name key, but it also needs to check for any variables in the dataset with name key. (That's not too hard to implement, it could be done by overloading set_node on DataTree to check against variables as well as children, for example.)

What is more difficult is if a child with name key exists, but the user tries to assign a variable with name key to the wrapped dataset. If the user does this via node.ds.assign(key=new_da) then that's manageable - in that case assign() has a return value, which they need to assign to the node via node.ds = node.ds.assign(key=new_da). We could check for name conflicts with children in the .ds property setter method.

However if the user adds a variable via node.ds[key] = new_da then I think node.ds will be updated in-place without it's wrapping DataTree class ever having a chance to intervene. A similar issue with node[key] = new_da is preventable by improving checking in DataTree.__setitem__, but I don't know how we can prevent this happening when all that is being called is Dataset.__setitem__.

I don't really know what to do about this, other than have a much more complicated class design which is no longer simple composition 😕 Any ideas @dcherian maybe?

The text was updated successfully, but these errors were encountered:

TomNicholas · 2021-09-04T03:21:49Z

#40 fixes 2/3 of these possible name collisions via better checks, but the last one I still don't know how to fix:

However if the user adds a variable via node.ds[key] = new_da then I think node.ds will be updated in-place without it's wrapping DataTree class ever having a chance to intervene.

TomNicholas · 2021-09-15T16:26:36Z

@shoyer here is a short code example to demonstrate the problem, should work with most recent version of datatree (and xarray):

In [1]: import numpy as np

In [2]: import xarray as xr

In [3]: from datatree import DataNode

In [4]: dt = DataNode('root', data=xr.Dataset(), children=[DataNode('group')])

In [5]: print(dt)
DataNode('root')
│   Dimensions:  ()
│   Data variables:
│       *empty*
└── DataNode('group')

Now we are going to do the modification that I want to prevent

In [6]: dt.ds['group'] = np.array(0)

In [7]: print(dt)
DataNode('root')
│   Dimensions:  ()
│   Data variables:
│       group    int64 0
└── DataNode('group')

In [8]: dt['group']
Out[8]: 
<xarray.DataArray 'group' ()>
array(0)

The problem is that at In [7]: we have an ambiguous situation, with a name collision between the data variable called 'group' and the child node called 'group'. In In [8]: shows that currently the ambiguity resolves to the data variable.

TomNicholas added bug Something isn't working help wanted Extra attention is needed labels Sep 3, 2021

TomNicholas mentioned this issue Sep 4, 2021

Name collisions #40

Merged

TomNicholas added the design question label Sep 5, 2021

jhamman mentioned this issue Sep 21, 2021

Store variables in DataTree instead of storing Dataset #2

Closed

TomNicholas mentioned this issue Sep 22, 2021

Integrate variables into DataTree #41

Merged

5 tasks

TomNicholas mentioned this issue Nov 9, 2021

[Experimental] Refactor Dataset to store variables in a manifest pydata/xarray#5961

Closed

4 tasks

TomNicholas mentioned this issue Dec 19, 2021

Type protocol for internal variable mapping pydata/xarray#6086

Closed

3 tasks

TomNicholas mentioned this issue Apr 28, 2022

Proposed API and design for .ds data access #80

Closed

TomNicholas mentioned this issue May 24, 2022

Replace .ds with immutable DatasetView #99

Merged

5 tasks

TomNicholas closed this as completed in #41 Jun 16, 2022

shaprann mentioned this issue Jan 26, 2023

Why are tree nodes and DataArrays treated equally by __getitem__() and __setitem__()? #211

Closed

TomNicholas mentioned this issue Jan 30, 2024

Tree linkage + dataset in one object nonideal for UI tree view #303

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Name collisions between Dataset variables and child tree nodes #38

Name collisions between Dataset variables and child tree nodes #38

TomNicholas commented Sep 3, 2021 •

edited

Loading

TomNicholas commented Sep 4, 2021

TomNicholas commented Sep 15, 2021

Name collisions between Dataset variables and child tree nodes #38

Name collisions between Dataset variables and child tree nodes #38

Comments

TomNicholas commented Sep 3, 2021 • edited Loading

TomNicholas commented Sep 4, 2021

TomNicholas commented Sep 15, 2021

TomNicholas commented Sep 3, 2021 •

edited

Loading