-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zip_subtrees for paired iteration over DataTrees #9623
Conversation
This should be used for implementing DataTree arithmetic inside map_over_datasets, so the result does not depend on the order in which child nodes are defined. I have also added a minimal implementation of breadth-first-search with an explicit queue the current recursion based solution in xarray.core.iterators (which has been removed). The new implementation is also slightly faster in my microbenchmark: In [1]: import xarray as xr In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)}) In [3]: %timeit _ = list(tree.subtree) # on main 87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # with this branch 55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In contrast to `equals`, `identical` now also checks that any inherited variables are inherited on both objects. However, they do not need to be inherited from the same source. This aligns the behavior of `identical` with the DataTree `__repr__`. I've also removed the `from_root` argument from `equals` and `identical`. If a user wants to compare trees from their roots, a better (simpler) inference is to simply call these methods on the `.root` properties. I would also like to remove the `strict_names` argument, but that will require switching to use the new `zip_subtrees` (pydata#9623) first.
I made a pass at re-implementing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only some minor typing remarks, the rest looks good!
assert result == expected | ||
|
||
def test_different_order(self): | ||
first: NamedNode = NamedNode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these additional type hints required?
Can mypy not resolve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be fixed by changing the class definition to NamedNode(TreeNode[Tree])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy does seem to struggle here -- it raises an error about missing type annotations.
I have not been able to precisely reproduce the mypy setup from CI on my local machine, so I'm going to save this existing issue for someone else to look into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can only see the issue of not passing the generic type to the parent class like I said before. But ofc, we can keep this open by now.
# iteration early | ||
yield active_nodes | ||
|
||
first_node = active_nodes[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, that theoretically it is possible to pass no arguments to this function. Then trees
and here active_nodes
is an empty tuple.
Maybe add something like this at the start:
if len(trees) < 2:
yield trees
return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I added an error to catch this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the result does not depend on the order in which child nodes are defined.
I'm not sure I understand this. Surely zip_subtrees
is zipping nodes according to the order they appear in .children
, which is the order they are defined?
# https://en.wikipedia.org/wiki/Breadth-first_search#Pseudocode | ||
queue = collections.deque([self]) | ||
while queue: | ||
node = queue.popleft() | ||
yield node | ||
queue.extend(node.children.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing the entire iterators.py
file with 6 lines is so clever it's almost rude 🤣
Let me restate this: zip_subtrees allows for zipping together multiple DataTree objects even if child nodes on different trees are defined in different orders, as long as the sets of each node's children match. |
And when you say "match" you mean the set of names of the children on tree A match the set of names of the children on tree B? |
Exactly, matching is based on relative path from each root. |
So I think this will basically create a breaking change relative to how dt1 = DataTree.from_dict({'a': ..., 'a/b': ..., 'a/c': ...})
dt2 = DataTree.from_dict({'e': ..., 'e/f': ..., 'e/g': ...})
dt1 * dt2 # would return a tree with names following dt1 I was never really sure if that generality was actually necessary though. Your new definition of Therefore if we're relaxing this then I think there is no longer any need to think of the data model of datatree as being an ordered set of children, as no mapping behaviour will depend on that order any longer. I think it's okay to change this behaviour, including getting rid of the The last sentence of this section of the docs will need changing, and the change would merit an entry in the migration guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think I understand the behaviour change here, the implementation is great! I really like the idea of zip_subtrees
as a primitive.
I'll do this in the next PR when I migrate over the |
* Updates to DataTree.equals and DataTree.identical In contrast to `equals`, `identical` now also checks that any inherited variables are inherited on both objects. However, they do not need to be inherited from the same source. This aligns the behavior of `identical` with the DataTree `__repr__`. I've also removed the `from_root` argument from `equals` and `identical`. If a user wants to compare trees from their roots, a better (simpler) inference is to simply call these methods on the `.root` properties. I would also like to remove the `strict_names` argument, but that will require switching to use the new `zip_subtrees` (#9623) first. * More efficient check for inherited coordinates
* main: Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619)
* main: (63 commits) Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655) Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619) ...
This should be used for implementing DataTree arithmetic inside map_over_datasets, so the result does not depend on the order in which child nodes are defined.
I have also added a minimal implementation of breadth-first-search with an explicit queue the current recursion based solution in xarray.core.iterators (which has been removed). The new implementation is also slightly faster in my microbenchmark: