Re-implement map_over_datasets #2

shoyer · 2024-10-15T15:49:34Z

The main changes:

It is implemented using zip_subtrees, which means it should properly
handle DataTrees where the nodes are defined in a different order.
For simplicity, I removed handling of **kwargs, in order to preserve
some flexibility for adding keyword arugments.
I removed automatic skipping of empty nodes, because there are almost
assuredly cases where that would make sense. This could be restored
with a option keyword arugment.

I wonder if we should also change map_over_datasets from being called like map_over_datasets(func)(*args) to map_over_datasets(func, *args), which would be more consistent with apply_ufunc.

* test unary op * implement and generate unary ops * test for unary op with inherited coordinates * re-enable arithmetic tests * implementation for binary ops * test ds * dt commutativity * ensure other types defer to DataTree, thus fixing pydata#9365 * test for inplace binary op * pseudocode implementation of inplace binary op, and xfail test * remove some unneeded type: ignore comments * return type should be DataTree * type datatree ops as accepting dataset-compatible types too * use same type hinting hack as Dataset does for __eq__ not being same as Mapping * ignore return type * add some methods to api docs * don't try to import DataTree.astype in API docs * test to check that single-node trees aren't broadcast * return NotImplemented * remove pseudocode for inplace binary ops * map_over_subtree -> map_over_datasets

* sketch of migration guide * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * whatsnew * add date * spell out API changes in more detail * details on backends integration * explain alignment and open_groups * explain coordinate inheritance * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * re-trigger CI * remove bullet about map_over_subtree * Markdown formatting for important warning block Co-authored-by: Matt Savoie <github@flamingbear.com> * Reorder changes in order of importance Co-authored-by: Matt Savoie <github@flamingbear.com> * Clearer wording on setting relationships Co-authored-by: Matt Savoie <github@flamingbear.com> * remove "technically" Co-authored-by: Matt Savoie <github@flamingbear.com> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Matt Savoie <github@flamingbear.com>

The main changes: - It is implemented using zip_subtrees, which means it should properly handle DataTrees where the nodes are defined in a different order. - For simplicity, I removed handling of `**kwargs`, in order to preserve some flexibility for adding keyword arugments. - I removed automatic skipping of empty nodes, because there are almost assuredly cases where that would make sense. This could be restored with a option keyword arugment.

As mentioned in pydata#2157, the docstring of `Dataset.groupby` does not reflect deprecation of squeeze (as the docstring of `DataArray.groupby` does) and states an incorrect default value.

* Add inherit=False option to DataTree.copy() This PR adds a inherit=False option to DataTree.copy, so users can decide if they want to inherit coordinates from parents or not when creating a subtree. The default behavior is `inherit=True`, which is a breaking change from the current behavior where parent coordinates are dropped (which I believe should be considered a bug). * fix typing * add migration guide note * ignore typing error

* Bug fixes for DataTree indexing and aggregation My implementation of indexing and aggregation was incorrect on child nodes, re-creating the child nodes from the root. There was also another bug when indexing inherited coordinates that meant formerly inherited coordinates were incorrectly dropped from results. * disable broken test

@headtr1ck

as suggested by @headtr1ck in pydata#9628 (comment)

TomNicholas · 2024-10-15T20:58:41Z

I removed automatic skipping of empty nodes, because there are almost
assuredly cases where that would make sense. This could be restored
with a option keyword arugment.

I will look at this in more detail in a bit, but there are definitely issues on the original datatree repo caused by not skipping empty nodes.

TomNicholas

It is implemented using zip_subtrees, which means it should properly
handle DataTrees where the nodes are defined in a different order.

👍

For simplicity, I removed handling of **kwargs, in order to preserve
some flexibility for adding keyword arguments.

So again the idea here is that that is more similar to apply_ufunc?

I removed automatic skipping of empty nodes, because there are almost
assuredly cases where that would make sense. This could be restored
with a option keyword arugment.

I will look at this in more detail in a bit, but there are definitely issues on the original datatree repo caused by not skipping empty nodes.

The issue I was thinking of was xarray-contrib/datatree#262, but actually that would also be handled by pydata#9588.

I wonder if we should also change map_over_datasets from being called like map_over_datasets(func)(*args) to map_over_datasets(func, *args), which would be more consistent with apply_ufunc.

👍 The existing signature was motivated by me wanting to decorate inherited methods, but we don't really need that now. Consistency with apply_ufunc is a good point though. I agree that map_over_datasets(func, *args) might be better - that's also more similar to how the method version DataTree.map_over_datasets currently works, and similar to Dataset.map.

TomNicholas · 2024-10-16T02:37:02Z

xarray/core/datatree_mapping.py

+                    node_dataset_args.insert(i, arg)
+
+            path = (
+                "/"


Are you sure that's intended? I think this might effectively orphan the result (possibly similar to pydata#9626 (review)).

* type hints for datatree ops tests * type hints for datatree aggregations tests * type hints for datatree indexing tests * type hint a lot more tests * more type hints

* Add zip_subtrees for paired iteration over DataTrees This should be used for implementing DataTree arithmetic inside map_over_datasets, so the result does not depend on the order in which child nodes are defined. I have also added a minimal implementation of breadth-first-search with an explicit queue the current recursion based solution in xarray.core.iterators (which has been removed). The new implementation is also slightly faster in my microbenchmark: In [1]: import xarray as xr In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)}) In [3]: %timeit _ = list(tree.subtree) # on main 87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # with this branch 55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) * fix pytype error * Tweaks per review

TomNicholas and others added 3 commits October 15, 2024 14:14

shoyer force-pushed the zip_subtree_map branch from 8c224b2 to 739573a Compare October 15, 2024 15:59

shoyer mentioned this pull request Oct 15, 2024

Add zip_subtrees for paired iteration over DataTrees pydata/xarray#9623

Merged

1 task

Sibgatulin and others added 4 commits October 15, 2024 09:05

docs(groupby): mention deprecation of squeeze kwarg (pydata#9625)

c3dabe1

As mentioned in pydata#2157, the docstring of `Dataset.groupby` does not reflect deprecation of squeeze (as the docstring of `DataArray.groupby` does) and states an incorrect default value.

Add missing memo argument to DataTree.__deepcopy__ (pydata#9631)

aafc278

as suggested by @headtr1ck in pydata#9628 (comment)

TomNicholas reviewed Oct 16, 2024

View reviewed changes

TomNicholas and others added 3 commits October 16, 2024 08:42

Type check datatree tests (pydata#9632)

88a95cf

* type hints for datatree ops tests * type hints for datatree aggregations tests * type hints for datatree indexing tests * type hint a lot more tests * more type hints

Merge branch 'main' into zip_subtree_map

bed8cba

shoyer mentioned this pull request Oct 16, 2024

Re-implement map_over_datasets using group_subtrees pydata/xarray#9636

Open

2 tasks

shoyer closed this Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement map_over_datasets #2

Re-implement map_over_datasets #2

shoyer commented Oct 15, 2024 •

edited

Loading

TomNicholas commented Oct 15, 2024

TomNicholas left a comment

TomNicholas Oct 16, 2024

Re-implement map_over_datasets #2

Re-implement map_over_datasets #2

Conversation

shoyer commented Oct 15, 2024 • edited Loading

TomNicholas commented Oct 15, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas Oct 16, 2024

Choose a reason for hiding this comment

shoyer commented Oct 15, 2024 •

edited

Loading