Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zip_subtrees for paired iteration over DataTrees #9623

Merged
merged 5 commits into from
Oct 16, 2024

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Oct 15, 2024

This should be used for implementing DataTree arithmetic inside map_over_datasets, so the result does not depend on the order in which child nodes are defined.

I have also added a minimal implementation of breadth-first-search with an explicit queue the current recursion based solution in xarray.core.iterators (which has been removed). The new implementation is also slightly faster in my microbenchmark:

In [1]: import xarray as xr

In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)})

In [3]: %timeit _ = list(tree.subtree)
# on main
87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# with this branch
55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
  • Tests added

This should be used for implementing DataTree arithmetic inside
map_over_datasets, so the result does not depend on the order in which
child nodes are defined.

I have also added a minimal implementation of breadth-first-search with
an explicit queue the current recursion based solution in
xarray.core.iterators (which has been removed). The new implementation
is also slightly faster in my microbenchmark:

    In [1]: import xarray as xr

    In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)})

    In [3]: %timeit _ = list(tree.subtree)
    # on main
    87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

    # with this branch
    55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
@shoyer shoyer requested a review from TomNicholas October 15, 2024 06:49
shoyer added a commit to shoyer/xarray that referenced this pull request Oct 15, 2024
In contrast to `equals`, `identical` now also checks that any
inherited variables are inherited on both objects. However, they do
not need to be inherited from the same source. This aligns the
behavior of `identical` with the DataTree `__repr__`.

I've also removed the `from_root` argument from `equals` and `identical`.
If a user wants to compare trees from their roots, a better (simpler)
inference is to simply call these methods on the `.root` properties.
I would also like to remove the `strict_names` argument, but that will
require switching to use the new `zip_subtrees` (pydata#9623) first.
@shoyer
Copy link
Member Author

shoyer commented Oct 15, 2024

I made a pass at re-implementing map_over_datasets using zip_subtree in shoyer#2

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some minor typing remarks, the rest looks good!

xarray/tests/test_treenode.py Outdated Show resolved Hide resolved
xarray/tests/test_treenode.py Outdated Show resolved Hide resolved
xarray/tests/test_treenode.py Outdated Show resolved Hide resolved
assert result == expected

def test_different_order(self):
first: NamedNode = NamedNode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these additional type hints required?
Can mypy not resolve it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fixed by changing the class definition to NamedNode(TreeNode[Tree])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy does seem to struggle here -- it raises an error about missing type annotations.

I have not been able to precisely reproduce the mypy setup from CI on my local machine, so I'm going to save this existing issue for someone else to look into.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only see the issue of not passing the generic type to the parent class like I said before. But ofc, we can keep this open by now.

xarray/core/treenode.py Outdated Show resolved Hide resolved
# iteration early
yield active_nodes

first_node = active_nodes[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, that theoretically it is possible to pass no arguments to this function. Then trees and here active_nodes is an empty tuple.

Maybe add something like this at the start:

if len(trees) < 2:
    yield trees
    return

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I added an error to catch this case.

@headtr1ck headtr1ck added the topic-DataTree Related to the implementation of a DataTree class label Oct 15, 2024
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the result does not depend on the order in which child nodes are defined.

I'm not sure I understand this. Surely zip_subtrees is zipping nodes according to the order they appear in .children, which is the order they are defined?

Comment on lines +405 to +410
# https://en.wikipedia.org/wiki/Breadth-first_search#Pseudocode
queue = collections.deque([self])
while queue:
node = queue.popleft()
yield node
queue.extend(node.children.values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing the entire iterators.py file with 6 lines is so clever it's almost rude 🤣

@shoyer
Copy link
Member Author

shoyer commented Oct 15, 2024

so the result does not depend on the order in which child nodes are defined.

I'm not sure I understand this. Surely zip_subtrees is zipping nodes according to the order they appear in .children, which is the order they are defined?

Let me restate this: zip_subtrees allows for zipping together multiple DataTree objects even if child nodes on different trees are defined in different orders, as long as the sets of each node's children match.

@TomNicholas
Copy link
Member

And when you say "match" you mean the set of names of the children on tree A match the set of names of the children on tree B?

@shoyer
Copy link
Member Author

shoyer commented Oct 15, 2024

And when you say "match" you mean the set of names of the children on tree A match the set of names of the children on tree B?

Exactly, matching is based on relative path from each root.

@TomNicholas
Copy link
Member

matching is based on relative path from each root.

So I think this will basically create a breaking change relative to how map_over_subtree (and hence all arithmetic) used to work in xarray-contrib/datatree. In the old model the definition of isomorphic was such that you could actually multiply two trees with the same structure but differently-named nodes, e.g.

dt1 = DataTree.from_dict({'a': ..., 'a/b': ..., 'a/c': ...})
dt2 = DataTree.from_dict({'e': ..., 'e/f': ..., 'e/g': ...})

dt1 * dt2  # would return a tree with names following dt1

I was never really sure if that generality was actually necessary though.

Your new definition of isomorphic is sort of like having strict_names=True always, except that the as long as the set of names match then their names will be used to determine their corresponding nodes in the other tree, regardless of order.

Therefore if we're relaxing this then I think there is no longer any need to think of the data model of datatree as being an ordered set of children, as no mapping behaviour will depend on that order any longer.

I think it's okay to change this behaviour, including getting rid of the strict_names kwarg. After all I have never once needed the generality above! I just wanted to really spell out the implications to make sure I and anyone reading is following.

The last sentence of this section of the docs will need changing, and the change would merit an entry in the migration guide.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think I understand the behaviour change here, the implementation is great! I really like the idea of zip_subtrees as a primitive.

@shoyer
Copy link
Member Author

shoyer commented Oct 16, 2024

The last sentence of this section of the docs will need changing, and the change would merit an entry in the migration guide.

I'll do this in the next PR when I migrate over the map_over_datasets stuff.

@shoyer shoyer merged commit 0c1d02e into pydata:main Oct 16, 2024
29 checks passed
shoyer added a commit that referenced this pull request Oct 16, 2024
* Updates to DataTree.equals and DataTree.identical

In contrast to `equals`, `identical` now also checks that any
inherited variables are inherited on both objects. However, they do
not need to be inherited from the same source. This aligns the
behavior of `identical` with the DataTree `__repr__`.

I've also removed the `from_root` argument from `equals` and `identical`.
If a user wants to compare trees from their roots, a better (simpler)
inference is to simply call these methods on the `.root` properties.
I would also like to remove the `strict_names` argument, but that will
require switching to use the new `zip_subtrees` (#9623) first.

* More efficient check for inherited coordinates
dcherian added a commit to TomAugspurger/xarray that referenced this pull request Oct 21, 2024
* main:
  Fix multiple grouping with missing groups (pydata#9650)
  flox: Properly propagate multiindex (pydata#9649)
  Update Datatree html repr to indicate inheritance (pydata#9633)
  Re-implement map_over_datasets using group_subtrees (pydata#9636)
  fix zarr intersphinx (pydata#9652)
  Replace black and blackdoc with ruff-format (pydata#9506)
  Fix error and missing code cell in io.rst (pydata#9641)
  Support alternative names for the root node in DataTree.from_dict (pydata#9638)
  Updates to DataTree.equals and DataTree.identical (pydata#9627)
  DOC: Clarify error message in open_dataarray (pydata#9637)
  Add zip_subtrees for paired iteration over DataTrees (pydata#9623)
  Type check datatree tests (pydata#9632)
  Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631)
  Bug fixes for DataTree indexing and aggregation (pydata#9626)
  Add inherit=False option to DataTree.copy() (pydata#9628)
  docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625)
  Migration guide for users of old datatree repo (pydata#9598)
  Reimplement Datatree typed ops (pydata#9619)
dcherian added a commit to dcherian/xarray that referenced this pull request Oct 22, 2024
* main: (63 commits)
  Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651)
  Change URL for pydap test (pydata#9655)
  Fix multiple grouping with missing groups (pydata#9650)
  flox: Properly propagate multiindex (pydata#9649)
  Update Datatree html repr to indicate inheritance (pydata#9633)
  Re-implement map_over_datasets using group_subtrees (pydata#9636)
  fix zarr intersphinx (pydata#9652)
  Replace black and blackdoc with ruff-format (pydata#9506)
  Fix error and missing code cell in io.rst (pydata#9641)
  Support alternative names for the root node in DataTree.from_dict (pydata#9638)
  Updates to DataTree.equals and DataTree.identical (pydata#9627)
  DOC: Clarify error message in open_dataarray (pydata#9637)
  Add zip_subtrees for paired iteration over DataTrees (pydata#9623)
  Type check datatree tests (pydata#9632)
  Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631)
  Bug fixes for DataTree indexing and aggregation (pydata#9626)
  Add inherit=False option to DataTree.copy() (pydata#9628)
  docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625)
  Migration guide for users of old datatree repo (pydata#9598)
  Reimplement Datatree typed ops (pydata#9619)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants