This repository has been archived by the owner on Oct 24, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 41
Name collisions between Dataset variables and child tree nodes #38
Labels
Comments
TomNicholas
added
bug
Something isn't working
help wanted
Extra attention is needed
labels
Sep 3, 2021
Merged
#40 fixes 2/3 of these possible name collisions via better checks, but the last one I still don't know how to fix:
|
@shoyer here is a short code example to demonstrate the problem, should work with most recent version of datatree (and xarray): In [1]: import numpy as np
In [2]: import xarray as xr
In [3]: from datatree import DataNode
In [4]: dt = DataNode('root', data=xr.Dataset(), children=[DataNode('group')])
In [5]: print(dt)
DataNode('root')
│ Dimensions: ()
│ Data variables:
│ *empty*
└── DataNode('group') Now we are going to do the modification that I want to prevent In [6]: dt.ds['group'] = np.array(0)
In [7]: print(dt)
DataNode('root')
│ Dimensions: ()
│ Data variables:
│ group int64 0
└── DataNode('group')
In [8]: dt['group']
Out[8]:
<xarray.DataArray 'group' ()>
array(0) The problem is that at |
4 tasks
3 tasks
5 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
I realised that it is currently possible to get a tree into a state which (a) cannot be represented as a netCDF file, and (b) means
__getitem__
becomes ambiguous.See this example:
Here
print(dt)
shows thatdt
is in a form forbidden by netCDF, because we have a child node and a variable with the same name (equivalent to having a group and a variable with the same name at the same level in netcdf).Furthermore, when choosing an item via
DataTree.__getitem__
it merrily picks out the DataArray even though this is an ambiguous situation and I might have intended to pick out the child node'a'
instead.The node is still accessible via
.get_node
, but only because.get_node
is inherited fromTreeNode
, which has no concept of data variables.Contrast this silent collision of variable and child names with what happens if you try to assign two children with the same name:
To prevent this we need better checks on assignment between variables and children. For example
TreeNode.set_node(key, new_child)
currently checks for any existing children with namekey
, but it also needs to check for any variables in the dataset with namekey
. (That's not too hard to implement, it could be done by overloadingset_node
onDataTree
to check against variables as well as children, for example.)What is more difficult is if a child with name
key
exists, but the user tries to assign a variable with namekey
to the wrapped dataset. If the user does this vianode.ds.assign(key=new_da)
then that's manageable - in that caseassign()
has a return value, which they need to assign to the node vianode.ds = node.ds.assign(key=new_da)
. We could check for name conflicts with children in the.ds
property setter method.However if the user adds a variable via
node.ds[key] = new_da
then I thinknode.ds
will be updated in-place without it's wrappingDataTree
class ever having a chance to intervene. A similar issue withnode[key] = new_da
is preventable by improving checking inDataTree.__setitem__
, but I don't know how we can prevent this happening when all that is being called isDataset.__setitem__
.I don't really know what to do about this, other than have a much more complicated class design which is no longer simple composition 😕 Any ideas @dcherian maybe?
The text was updated successfully, but these errors were encountered: