Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DataTree.coords.__setitem__ by adding DataTreeCoordinates class #9451

Merged
merged 50 commits into from
Sep 11, 2024

Conversation

TomNicholas
Copy link
Member

Doesn't work yet - pushing to get feedback on the approach.

cc @shoyer

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Sep 8, 2024
Comment on lines 812 to 813
# TODO should inherited coordinates be here? It would be very hard to allow updating them...
# But actually maybe the ChainMap approach would make this work okay??
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see an easy way to make assignment of coordinates to parent nodes work. I think it would be better just to have dt.coords constructed from inherited coords, but not allow setting coordinates further up the tree. I believe that's consistent with the behaviour we currently have for dt.__setitem__ anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Here are some ideas

xarray/core/coordinates.py Outdated Show resolved Hide resolved
xarray/core/coordinates.py Outdated Show resolved Hide resolved
xarray/core/coordinates.py Outdated Show resolved Hide resolved
Comment on lines 828 to 829
# TODO is there a potential bug here? What happens if a dim is only present on data variables?
return self._data.dims
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine. Dimensions on DatasetCoordinates also include dimensions that are only present on data variables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? This (on main) looks wrong to me:

In [1]: import xarray as xr

In [2]: ds = xr.Dataset({'a': ('x', [0, 1])})

In [3]: ds
Out[3]: 
<xarray.Dataset> Size: 16B
Dimensions:  (x: 2)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 16B 0 1

In [4]: ds.coords
Out[4]: 
Coordinates:
    *empty*

In [5]: ds.coords.dims
Out[5]: FrozenMappingWarningOnValuesAccess({'x': 2})

I mean the fact no-one has raised this before means it probably isn't of much consequence, but it does seem incorrect / misleading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised #9466 to track this

xarray/core/coordinates.py Outdated Show resolved Hide resolved
self, coords: dict[Hashable, Variable], indexes: Mapping[Any, Index]
) -> None:

# TODO I don't know how to update coordinates that live in parent nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK for this to be an error. The user can replace those coordinates on the parent nodes.

xarray/core/coordinates.py Outdated Show resolved Hide resolved
xarray/core/coordinates.py Show resolved Hide resolved
@TomNicholas

This comment was marked as outdated.

Comment on lines 891 to 896
def _drop_indexed_coords(self, coords_to_drop: set[Hashable]) -> None:
assert self._data.xindexes is not None
new_coords = drop_indexed_coords(coords_to_drop, self)
for name in self._data._node_coord_variables - new_coords._names:
del self._data._node_coord_variables[name]
self._data._node_indexes = dict(new_coords.xindexes)
Copy link
Member Author

@TomNicholas TomNicholas Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that this ._drop_indexed_coords method, and its equivalent in DatasetCoordinates, is never being called... I can't find any references to it, and if I raise in both methods then the whole test suite still passes locally.

@dcherian @benbovy you both touched this method last, do you know why it still exists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 9dc845a

@TomNicholas TomNicholas marked this pull request as ready for review September 9, 2024 17:40
# TODO should inherited coordinates be here? It would be very hard to allow updating them...
# But actually maybe the ChainMap approach would make this work okay??

_data: DataTree
Copy link
Member Author

@TomNicholas TomNicholas Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inheriting from Coordinates and using DataTree instead of DataWithCoords here both expose a few legitimate typing issues. Particularly:

Mostly we have got away with these discrepancies so far but inheritance here is probably going to mean I need to either ignore or cast all over the place?

cc @headtr1ck

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this wasn't very difficult to work around, it mostly just required adding a type: ignore[assignment] here :)

@TomNicholas
Copy link
Member Author

I think this is basically ready now @shoyer

xarray/core/coordinates.py Outdated Show resolved Hide resolved
xarray/core/coordinates.py Outdated Show resolved Hide resolved
xarray/core/coordinates.py Outdated Show resolved Hide resolved
xarray/core/coordinates.py Outdated Show resolved Hide resolved
xarray/core/coordinates.py Outdated Show resolved Hide resolved
xarray/core/coordinates.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/tests/test_datatree.py Show resolved Hide resolved
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, thanks Tom!

@TomNicholas TomNicholas enabled auto-merge (squash) September 11, 2024 03:48
@TomNicholas TomNicholas merged commit 781877c into pydata:main Sep 11, 2024
27 checks passed
@TomNicholas TomNicholas deleted the datatree_coords_setitem branch September 11, 2024 04:11
dcherian added a commit to dcherian/xarray that referenced this pull request Sep 17, 2024
* main: (26 commits)
  Forbid modifying names of DataTree objects with parents (pydata#9494)
  DAS-2155 - Merge datatree documentation into main docs. (pydata#9033)
  Make illegal path-like variable names when constructing a DataTree from a Dataset (pydata#9378)
  Ensure TreeNode doesn't copy in-place (pydata#9482)
  `open_groups` for zarr backends (pydata#9469)
  Update pyproject.toml (pydata#9484)
  New whatsnew section (pydata#9483)
  Release notes for v2024.09.0 (pydata#9480)
  Fix `DataTree.coords.__setitem__` by adding `DataTreeCoordinates` class (pydata#9451)
  Rename DataTree's "ds" and "data" to "dataset" (pydata#9476)
  Update DataTree repr to indicate inheritance (pydata#9470)
  Bump pypa/gh-action-pypi-publish in the actions group (pydata#9460)
  Repo checker (pydata#9450)
  Add days_in_year and decimal_year to dt accessor (pydata#9105)
  remove parent argument from DataTree.__init__ (pydata#9465)
  Fix inheritance in DataTree.copy() (pydata#9457)
  Implement `DataTree.__delitem__` (pydata#9453)
  Add ASV for datatree.from_dict (pydata#9459)
  Make the first argument in DataTree.from_dict positional only (pydata#9446)
  Fix typos across the code, doc and comments (pydata#9443)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Sep 17, 2024
* main: (29 commits)
  Release notes for v2024.09.0 (pydata#9480)
  Fix `DataTree.coords.__setitem__` by adding `DataTreeCoordinates` class (pydata#9451)
  Rename DataTree's "ds" and "data" to "dataset" (pydata#9476)
  Update DataTree repr to indicate inheritance (pydata#9470)
  Bump pypa/gh-action-pypi-publish in the actions group (pydata#9460)
  Repo checker (pydata#9450)
  Add days_in_year and decimal_year to dt accessor (pydata#9105)
  remove parent argument from DataTree.__init__ (pydata#9465)
  Fix inheritance in DataTree.copy() (pydata#9457)
  Implement `DataTree.__delitem__` (pydata#9453)
  Add ASV for datatree.from_dict (pydata#9459)
  Make the first argument in DataTree.from_dict positional only (pydata#9446)
  Fix typos across the code, doc and comments (pydata#9443)
  DataTree should not be "Generic" (pydata#9445)
  Disallow passing a DataArray as data into the DataTree constructor (pydata#9444)
  Support additional dtypes in `resample` (pydata#9413)
  Shallow copy parent and children in DataTree constructor (pydata#9297)
  Bump minimum versions for dependencies (pydata#9434)
  Always include at least one category in random test data (pydata#9436)
  Avoid deep-copy when constructing groupby codes (pydata#9429)
  ...
hollymandel pushed a commit to hollymandel/xarray that referenced this pull request Sep 23, 2024
…ss (pydata#9451)

* add a DataTreeCoordinates class

* passing read-only properties tests

* tests for modifying in-place

* WIP making the modification test pass

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* get to the delete tests

* test

* improve error message

* implement delitem

* test KeyError

* subclass Coordinates instead of DatasetCoordinates

* use Frozen(self._data._coord_variables)

* Simplify when to raise KeyError

Co-authored-by: Stephan Hoyer <shoyer@google.com>

* correct bug in suggestion

* Update xarray/core/coordinates.py

Co-authored-by: Stephan Hoyer <shoyer@google.com>

* simplify _update_coords by creating new node data first

* update indexes correctly

* passes test

* update ._drop_indexed_coords

* some mypy fixes

* remove the apparently-unused _drop_indexed_coords method

* fix import error

* test that Dataset and DataArray constructors can handle being passed a DataTreeCoordinates object

* test dt.coords can be passed to DataTree constructor

* improve readability of inline comment

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* initial tests with inherited coords

* ignore typeerror indicating dodgy inheritance

* try to avoid Unbound type error

* cast return value correctly

* cehck that .coords works with inherited coords

* fix data->dataset

* fix return type of __getitem__

* Use .dataset instead of .to_dataset()

Co-authored-by: Stephan Hoyer <shoyer@google.com>

* _check_alignment -> check_alignment

* remove dict comprehension

Co-authored-by: Stephan Hoyer <shoyer@google.com>

* KeyError message formatting

Co-authored-by: Stephan Hoyer <shoyer@google.com>

* keep generic types for .dims and .sizes

* test verifying you cant delete inherited coord

* fix mypy complaint

* type hint  as accepting  objects

* update note about .dims returning all dims

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Stephan Hoyer <shoyer@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

DataTree.coords.__setitem__ is broken
2 participants