Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Name collisions between Dataset variables and child tree nodes #38

Closed
TomNicholas opened this issue Sep 3, 2021 · 2 comments · Fixed by #41 or #99
Closed

Name collisions between Dataset variables and child tree nodes #38

TomNicholas opened this issue Sep 3, 2021 · 2 comments · Fixed by #41 or #99
Labels
bug Something isn't working design question help wanted Extra attention is needed

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Sep 3, 2021

I realised that it is currently possible to get a tree into a state which (a) cannot be represented as a netCDF file, and (b) means __getitem__ becomes ambiguous.


See this example:

In [3]: dt = DataNode('root', data=xr.Dataset({'a': [0], 'b': 1}))

In [4]: child = DataNode('a', data=None, parent=dt)

In [5]: print(dt)
DataNode('root')
│   Dimensions:  (a: 1)
│   Coordinates:
│     * a        (a) int64 0Data variables:
│       b        int64 1
└── DataNode('a')

In [6]: dt['a']
Out[6]: 
<xarray.DataArray 'a' (a: 1)>
array([0])
Coordinates:
  * a        (a) int64 0

In [7]: dt.get_node('a')
Out[7]: DataNode(name='a', parent='root', children=[],data=None)

Here print(dt) shows that dt is in a form forbidden by netCDF, because we have a child node and a variable with the same name (equivalent to having a group and a variable with the same name at the same level in netcdf).

Furthermore, when choosing an item via DataTree.__getitem__ it merrily picks out the DataArray even though this is an ambiguous situation and I might have intended to pick out the child node 'a' instead.

The node is still accessible via .get_node, but only because .get_node is inherited from TreeNode, which has no concept of data variables.

Contrast this silent collision of variable and child names with what happens if you try to assign two children with the same name:

In [8]: child = DataNode('a', data=None, parent=dt)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-8c4a956bcdd5> in <module>
----> 1 child = DataNode('a', data=None, parent=dt)

~/Documents/Work/Code/datatree/datatree/datatree.py in _init_single_datatree_node(cls, name, data, parent, children)
    162         # This approach was inspired by xarray.Dataset._construct_direct()
    163         obj = object.__new__(cls)
--> 164         obj = _init_single_treenode(obj, name=name, parent=parent, children=children)
    165         obj.ds = data
    166         return obj

~/Documents/Work/Code/datatree/datatree/treenode.py in _init_single_treenode(obj, name, parent, children)
     13     obj.name = name
     14 
---> 15     obj.parent = parent
     16     if children:
     17         obj.children = children

~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in parent(self, value)
    133             self.__check_loop(value)
    134             self.__detach(parent)
--> 135             self.__attach(value)
    136 
    137     def __check_loop(self, node):

~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in __attach(self, parent)
    157     def __attach(self, parent):
    158         if parent is not None:
--> 159             self._pre_attach(parent)
    160             parentchildren = parent.__children_or_empty
    161             assert not any(child is self for child in parentchildren), "Tree is corrupt."  # pragma: no cover

~/Documents/Work/Code/datatree/datatree/treenode.py in _pre_attach(self, parent)
     84         """
     85         if self.name in list(c.name for c in parent.children):
---> 86             raise KeyError(
     87                 f"parent {parent.name} already has a child named {self.name}"
     88             )

KeyError: 'parent root already has a child named a'

To prevent this we need better checks on assignment between variables and children. For example TreeNode.set_node(key, new_child) currently checks for any existing children with name key, but it also needs to check for any variables in the dataset with name key. (That's not too hard to implement, it could be done by overloading set_node on DataTree to check against variables as well as children, for example.)

What is more difficult is if a child with name key exists, but the user tries to assign a variable with name key to the wrapped dataset. If the user does this via node.ds.assign(key=new_da) then that's manageable - in that case assign() has a return value, which they need to assign to the node via node.ds = node.ds.assign(key=new_da). We could check for name conflicts with children in the .ds property setter method.

However if the user adds a variable via node.ds[key] = new_da then I think node.ds will be updated in-place without it's wrapping DataTree class ever having a chance to intervene. A similar issue with node[key] = new_da is preventable by improving checking in DataTree.__setitem__, but I don't know how we can prevent this happening when all that is being called is Dataset.__setitem__.

I don't really know what to do about this, other than have a much more complicated class design which is no longer simple composition 😕 Any ideas @dcherian maybe?

@TomNicholas TomNicholas added bug Something isn't working help wanted Extra attention is needed labels Sep 3, 2021
@TomNicholas
Copy link
Member Author

#40 fixes 2/3 of these possible name collisions via better checks, but the last one I still don't know how to fix:

However if the user adds a variable via node.ds[key] = new_da then I think node.ds will be updated in-place without it's wrapping DataTree class ever having a chance to intervene.

@TomNicholas
Copy link
Member Author

@shoyer here is a short code example to demonstrate the problem, should work with most recent version of datatree (and xarray):

In [1]: import numpy as np

In [2]: import xarray as xr

In [3]: from datatree import DataNode

In [4]: dt = DataNode('root', data=xr.Dataset(), children=[DataNode('group')])

In [5]: print(dt)
DataNode('root')
│   Dimensions:  ()
│   Data variables:
│       *empty*
└── DataNode('group')

Now we are going to do the modification that I want to prevent

In [6]: dt.ds['group'] = np.array(0)

In [7]: print(dt)
DataNode('root')
│   Dimensions:  ()
│   Data variables:
│       group    int64 0
└── DataNode('group')

In [8]: dt['group']
Out[8]: 
<xarray.DataArray 'group' ()>
array(0)

The problem is that at In [7]: we have an ambiguous situation, with a name collision between the data variable called 'group' and the child node called 'group'. In In [8]: shows that currently the ambiguity resolves to the data variable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working design question help wanted Extra attention is needed
Projects
None yet
1 participant