Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate datatreee assertions/extensions/formatting #8967

Merged

Conversation

owenlittlejohns
Copy link
Contributor

This PR continues the overall work of migrating DataTree into xarray.

  • xarray/core/datatree_render.py is the renamed version of xarray/datatree_/datatree/render.py.
  • xarray/core/extensions.py now contains functionality from xarray/datatree_/datatree/extensions.py.
  • xarray/core/formatting.py now contains functionality from xarray/datatree_/datatree/formatting.py.
  • xarray/tests/test_datatree.py now contains tests from xarray/datatree_/datatree/tests/test_dataset_api.py.
  • xarray/testing/assertions.py now contains functionality from /xarray/datatree_/datatree/testing.py.

I had also meant to get to common.py and what's left of io.py, but I've got a hefty couple of days of meetings ahead, so I wanted to get this progress into PR before that happens. @flamingbear or I can follow up with the remaining things in a separate PR. (Also this PR is already getting a little big, so maybe it's already got enough in it)

  • Contributes to migration step for miscellaneous modules in Track merging datatree into xarray #8572
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@@ -71,37 +71,6 @@ def check_isomorphic(
raise TreeIsomorphismError("DataTree objects are not isomorphic:\n" + diff)


def diff_treestructure(a: DataTree, b: DataTree, require_names_equal: bool) -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this into xarray/core/formatting.py to avoid a circular dependency issue.

`node`
:any:`NodeMixin` object.
It is up to the user to assemble these parts to a whole.
>>> from xarray import Dataset
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples in this documentation string are a bit shorter than the originals from anytree. That's because using node.children.values() gets a ValuesView which isn't compatible with iterables like reversed that alter the order of the items in the iterable.

@@ -1,6 +1,6 @@
import textwrap

from xarray import Dataset
from xarray.core.dataset import Dataset
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing another circular dependency issue. @flamingbear - just FYI, for when you are tweaking ops.py.

@@ -1,3 +1,4 @@
# TODO: Add assert_isomorphic when making DataTree API public
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed we didn't want to surface assert_isomorphic until everything else was public. Does that sound good to you @TomNicholas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there was an issue collecting things we need to do to put a final bow on things, but I'm not finding it. Should we add it to #8572? or is that overkill?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find a dedicated issue for that either. Yes lets' just make an explicit list under Expose datatree API publicly. on #8572 (I'll do that now)

@@ -81,12 +134,19 @@ def assert_equal(a, b):
assert a.equals(b), formatting.diff_dataset_repr(a, b, "equals")
elif isinstance(a, Coordinates):
assert a.equals(b), formatting.diff_coords_repr(a, b, "equals")
elif isinstance(a, DataTree):
from_root = kwargs.get("from_root", True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would people be more comfortable if I was doing some type checking here for from_root being a bool?

@@ -5,9 +5,14 @@
import pytest

import xarray as xr

# TODO: Remove imports in favour of xr.DataTree etc, once part of public API
Copy link
Contributor Author

@owenlittlejohns owenlittlejohns Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also hoping that once we can use xr.DataTree that all the type annotations that mypy has insisted on in the tests might be able to be removed (e.g., dt: DataTree = DataTree())

assert_identical(elders, expected)


class TestDSMethodInheritance:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This felt like a good place for adding these tests. But I could move them to another module if that's a preference (just probably not under the name test_dataset_api.py, which might be confusing outside of a specifically datatree oriented context).

"""
Two DataTrees are considered isomorphic if every node has the same number of children.

Nothing about the data in each node is checked.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Nothing about the data in each node is checked.
Nothing about the data or attrs in each node is checked.

Comment on lines 106 to 109
Arrays with NaN in the same location are considered equal. DataTrees are
equal if they have isomorphic node structures, with matching node names,
and if they have matching variables and coordinates, all of which are
equal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition here of equal/allclose/identical is "isomorphic, and also every set of corresponding nodes in the trees are equal/allclose/identical". It would be good to reword this to indicate that assert_equal for trees is essentially just assert_equal for Dataset but mapped over the trees.

Comment on lines 117 to 120
**kwargs : dict
Additional keyword arguments passed on to DataTree.equals. Specifically,
from_root, which defaults True, and will check the whole tree above
the given node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why assert_equal has **kwargs at all - they don't seem to be used?

assert_allclose accepts specific named tolerance kwargs, shouldn't we just add from_root explicitly to all of these functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have misunderstood the conversation in our tag-up today, but it sounded like **kwargs was a potentially preferred option to explicitly stating from_root, where from_root is only used by assert_equal or assert_identical when dealing with a DataTree. But happy to explicitly name from_root as an optional argument if that seems better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgetting about from_root for a moment, I just don't understand why xarray's existing assert_equal method already accepts **kwargs then does nothing with them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow here - I added **kwargs to both assert_equal and assert_identical in this PR. Then inside the condition for isinstance(a, DataTree) I try to grab from_root from those kwargs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I'm being stupid, ignore the comment about it already being there (it wasn't).

Why are you adding **kwargs though? Did we decide to do that in the meeting? I thought all we decided was that assert_* would need a from_root arg, even if that didn't do anything for non-DataTree inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@keewis keewis Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been suggesting that... the idea was that having it for non-DataTree objects might be confusing, but I guess **kwargs makes tab-completion less useful (but other than that there should not be any difference).

Edit: see the meeting notes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad for somehow missing that suggestion initially.

Still, I feel like hiding a kwarg away is more confusing than just showing it and saying it is only used if two DataTree objects are passed... For example apply_ufunc has arguments that are only used if the input is of a certain type.

Another thing we could do here is use typing.overloads to ensure that passing from_root when a and b are not both DataTrees would fail static type checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I was just thinking about using typing.overloads - I'll have a go at that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



@overload
def assert_equal(a, b): ...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried to specify all the individual overloads (e.g., assert_equal(a: Dataset, b: Dataset), etc). That led to issues with the return values from DataTree.__getitem__, which are: DataTree | DataArray.

Hopefully this hits enough of the spot, though.

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Apr 24, 2024
Copy link
Member

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the discussion and the resulting code, looks good to me now. Thanks.

f"Left and right {type(a).__name__} objects are not {_compat_to_str(compat)}"
]

# TODO check root parents?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO check root parents?

This should be covered by the parent assertions having a from_root kwarg.

Comment on lines 600 to 604
def test_print_datatree(self, simple_datatree):
dt = simple_datatree
print(dt)

# TODO work out how to test something complex like this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a placeholder for something better. All this does right now is check an error is raised. But we might not need more tests than the ones already in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

372dcbe

I just pulled this comment, but let me know if you actually wanted me to pull any of the tests, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just pull this test, as we are already testing that the __str__ method doesn't error in other tests.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of very small comments about removing outdated comments / tests, after that I will merge!

@TomNicholas TomNicholas changed the title Das 2067 datatree migration continues Migrate datatreee assertions/extensions/formatting Apr 26, 2024
@TomNicholas TomNicholas enabled auto-merge (squash) April 26, 2024 17:10
@TomNicholas TomNicholas merged commit 6d62719 into pydata:main Apr 26, 2024
30 checks passed
@owenlittlejohns owenlittlejohns deleted the DAS-2067-datatree-migration-continues branch April 26, 2024 17:38
@TomNicholas TomNicholas mentioned this pull request May 2, 2024
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

4 participants