Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataTree.assign_coords and similar should only set node coords #9472

Open
TomNicholas opened this issue Sep 10, 2024 · 9 comments
Open

DataTree.assign_coords and similar should only set node coords #9472

TomNicholas opened this issue Sep 10, 2024 · 9 comments
Labels
API design bug topic-DataTree Related to the implementation of a DataTree class

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Sep 10, 2024

What is your issue?

Currently DataTree.assign_coords tries to call Dataset.assign_coords on every node in the subtree. This was always a little weird, but now that we have implemented coordinate inheritance (#9077) it's really silly, because we only need to assign to the root node and the coordinates will be accessible on every child.

There's also some kind of bug right now too:

In [6]: ds = Dataset(
   ...:     data_vars={
   ...:         "foo": (["x", "y"], np.random.randn(2, 3)),
   ...:     },
   ...:     coords={
   ...:         "x": ("x", np.array([-1, -2], "int64")),
   ...:         "y": ("y", np.array([0, 1, 2], "int64")),
   ...:         "a": ("x", np.array([4, 5], "int64")),
   ...:         "b": np.int64(-10),
   ...:     },
   ...: )
   ...: dt = DataTree(data=ds)
   ...: dt["child"] = DataTree()

In [7]: child = dt["child"]

In [8]: child.assign_coords({'c': 11})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 child.assign_coords({'c': 11})

File ~/Documents/Work/Code/xarray/xarray/core/datatree_mapping.py:193, in map_over_subtree.<locals>._map_over_subtree(*args, **kwargs)
    190     out_data_objects[node_of_first_tree.path] = results
    192 # Find out how many return values we received
--> 193 num_return_values = _check_all_return_values(out_data_objects)
    195 # Reconstruct 1+ subtrees from the dict of results, by filling in all nodes of all result trees
    196 original_root_path = first_tree.path

File ~/Documents/Work/Code/xarray/xarray/core/datatree_mapping.py:282, in _check_all_return_values(returned_objects)
    279 """Walk through all values returned by mapping func over subtrees, raising on any invalid or inconsistent types."""
    281 if all(r is None for r in returned_objects.values()):
--> 282     raise TypeError(
    283         "Called supplied function on all nodes but found a return value of None for"
    284         "all of them."
    285     )
    287 result_data_objects = [
    288     (path_to_node, r)
    289     for path_to_node, r in returned_objects.items()
    290     if r is not None
    291 ]
    293 if len(result_data_objects) == 1:
    294     # Only one node in the tree: no need to check consistency of results between nodes

TypeError: Called supplied function on all nodes but found a return value of None forall of them.

We should change assign_coords (and possibly any similar functions like set_coords to just operate on one node, and not try to map over the subtree.

This would also help make the coordinate-related behaviour of DataTree more similar to that of Dataset, which relates to #9203 (comment).

cc @shoyer

@TomNicholas TomNicholas added bug topic-DataTree Related to the implementation of a DataTree class API design labels Sep 10, 2024
@TomNicholas
Copy link
Member Author

What should .reset_coords do if you ask it to reset an inherited coordinate? Should it reset it to be a data variable on the (potentially distant) parent object or assign it to the current node (where it did not technically exist before)?

@shoyer
Copy link
Member

shoyer commented Sep 10, 2024

What should .reset_coords do if you ask it to reset an inherited coordinate? Should it reset it to be a data variable on the (potentially distant) parent object or assign it to the current node (where it did not technically exist before)?

Probably best to error in such cases rather than making an arbitrary/unpredictable choice!

@dcherian
Copy link
Contributor

now that we have implemented coordinate inheritance (#9077)

assign_coords can take a callable though, so it can yield a different result at each node. (I have used this before).

When I used datatree, I generally did want to apply a number of operations at each node and felt that it was too much typing to always write out .map_over_subtree(lambda....).

Perhaps we should add a new namespace for operations that map over subtrees datatree.subtrees.assign_coords(..., min_depth=2, max_depth=4) and min_depth and max_depth control the shallowest and deepest levels at which to apply the operation.

(it's been a while so I might have the terminology mixed up. apologies if that is the case)

@TomNicholas
Copy link
Member Author

TomNicholas commented Sep 10, 2024

assign_coords can take a callable though, so it can yield a different result at each node. (I have used this before).

That's quite a niche use of that method... But I think it is an example of a more general problem.

felt that it was too much typing to always write out .map_over_subtree(lambda...).

Yeah I agree that is annoying.

Perhaps we should add a new namespace for operations that map over subtrees

This would be a reasonable solution to the fundamental ambiguity of whether or not a datatree method acts locally or over the whole subtree. It would also make the inheritance structure of DataTree a lot simpler, as all this mapping would occur on accessor methods rather than inherited methods - I probably would no longer need this weird pattern.

EDIT: It would mean that dt.<method> would be basically the same as dt.dataset.<method> except that the former returned DataTree and the latter returned Dataset.

(it's been a while so I might have the terminology mixed up. apologies if that is the case)

The mapping namespace would need to be disambiguated from the existing .subtree property though (which returns a list of all the descendant nodes). It would be nice to use .map, but that already has a meaning on Dataset: "map over variables in the dataset".

min_depth and max_depth control the shallowest and deepest levels at which to apply the operation.

I'm not a huge fan of adding these parameters to every single method - I would prefer to either use a syntax like

dt.subtrees(max_depth=...).assign_coords(...)

or just to leave these out for now.

@TomNicholas
Copy link
Member Author

TomNicholas commented Sep 10, 2024

Perhaps we should add a new namespace for operations that map over subtrees

Okay concrete suggestion:

dt.subtree returns Iterator[DataTree] as it does now.

Methods on the DataTree object act only on that node's data, so that

dt2 = dt.assign_coords()

and

dt2 = dt.copy()
dt2.ds = dt.ds.assign_coords()

leave dt2 and dt in the same state (i.e. as DataTree with new coordinates assigned only to one node).

Methods on DataTree which map over nodes are accessible via .map_over_subtree:

dt.map_over_subtree.assign_coords()

which could maybe be extended to accept arguments:

dt.map_over_subtree(max_depth=3).assign_coords()

This is incompatible with the current behaviour of the map_over_subtree method:

dt.map_over_subtree(lambda: ds)

so we add a .pipe method to the namespace to cover that use case instead:

dt.map_over_subtree.pipe(lambda: ds)

The map_over_subtree top-level decorator is separate, and remains unchanged:

@map_over_subtree
def func(ds):
    ...

though it could be extended to accept arguments later if necessary:

@map_over_subtree(max_depth=2)
def func(ds):
    ...

Pros:

  1. Always disambiguates between methods that act on local node data and methods that map over the entire subtree,
  2. Makes it obvious which methods are not implemented as mappings over the tree (i.e. dt.map_over_subtree.groupby currently raises an AttributeError),
  3. Opens the door to DataTree and Dataset sharing a base class with a lot of common methods,
  4. Follows the general method-chaining pattern we have with groupby etc.
  5. Is compatible with the existing .map and .pipe methods also being defined on DataTree.

Cons:

  1. Can be very wordy, especially if method-chaining operations over entire subtrees, e.g. dt.map_over_subtree.assign_coords().map_over_subtree.mean()

What do people think? @dcherian @shoyer

@shoyer
Copy link
Member

shoyer commented Sep 10, 2024

I agree that assign_coords should only set node coordinates.

I'm not sure that this should be a general principle for all DataTree operations. Operations like arithmetic and aggregations seem perfectly well-defined acting on all nodes independently.

@TomNicholas
Copy link
Member Author

I agree, but my proposal above is attempting to provide intuitive API to do both types of operation. Otherwise it's unclear to users (including myself!) when and why .assign_coords acts locally but .mean doesn't.

@shoyer
Copy link
Member

shoyer commented Sep 10, 2024

I think of assign and assign_coords as sugar for assignment via __setitem__ and coords.__setitem__.

The rule could be:

  1. Methods that move around or reorganize data (i.e., Dataset-only methods like assign) are defined relative to the entire DataTree
  2. Methods that manipulate data (i.e., DataArray methods like mean or sel) are mapped over DataArray/Dataset entries, without duplicating coordinates.

These are essentially the same rules used by Dataset for wrapping DataArray.

@TomNicholas
Copy link
Member Author

In that case that's mostly what I already had, with the exception of these methods, which presumably should instead act over one node.

https://xarray-datatree.readthedocs.io/en/latest/api.html#datatree-contents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design bug topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

No branches or pull requests

3 participants