Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datatree alignment docs #9501

Merged
merged 41 commits into from
Oct 13, 2024
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
ae71437
remove too-long underline
TomNicholas Sep 15, 2024
928767a
draft section on data alignment
TomNicholas Sep 15, 2024
1adb945
fixes
TomNicholas Sep 15, 2024
ae1bcfd
draft section on coordinate inheritance
TomNicholas Sep 15, 2024
f025371
various improvements
TomNicholas Sep 15, 2024
7549ee9
more improvements
TomNicholas Sep 15, 2024
b631697
link from other page
TomNicholas Sep 15, 2024
02bf96b
align call include all 3 datasets
TomNicholas Sep 15, 2024
152d74a
link back to use cases
TomNicholas Sep 15, 2024
57b7f06
clarification
TomNicholas Sep 15, 2024
d3ac1a7
small improvements
TomNicholas Sep 15, 2024
adf7579
Merge branch 'main' into datatree_alignment_docs
TomNicholas Sep 23, 2024
d73dd8a
remove TODO after #9532
TomNicholas Sep 23, 2024
d779e22
add todo about #9475
TomNicholas Sep 23, 2024
3c9ad55
correct xr.align example call
TomNicholas Sep 23, 2024
5a4309a
Merge branch 'datatree_alignment_docs' of https://github.com/TomNicho…
TomNicholas Sep 23, 2024
4cee745
add links to netCDF4 documentation
TomNicholas Sep 23, 2024
4c030d8
Consistent voice
TomNicholas Sep 23, 2024
09385fd
Merge branch 'main' into datatree_alignment_docs
TomNicholas Sep 26, 2024
35ab311
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 6, 2024
6db4a0b
keep indexes in lat lon selection to dodge #9475
TomNicholas Oct 6, 2024
22f2726
Merge branch 'datatree_alignment_docs' of https://github.com/TomNicho…
TomNicholas Oct 6, 2024
e879dbb
unpack generator properly
TomNicholas Oct 6, 2024
401c6b0
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 10, 2024
118e802
ideas for next section
TomNicholas Oct 10, 2024
c129eb1
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 11, 2024
b245bdd
briefly summarize what alignment means
TomNicholas Oct 12, 2024
9b8fc9b
clarify that it's the data in each node that was previously unrelated
TomNicholas Oct 12, 2024
b6385ce
fix incorrect indentation of code block
TomNicholas Oct 12, 2024
d2918bb
display the tree with redundant coordinates again
TomNicholas Oct 12, 2024
6cab6f8
remove content about non-inherited coords for a follow-up PR
TomNicholas Oct 12, 2024
af5c6b7
remove todo
TomNicholas Oct 12, 2024
00105a4
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 13, 2024
d49c2de
remove todo now that aggregations are re-implemented
TomNicholas Oct 13, 2024
a3d5223
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 13, 2024
44b14ef
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 13, 2024
44bcf6c
remove link to (unmerged) migration guide
TomNicholas Oct 13, 2024
ee78160
Merge branch 'datatree_alignment_docs' of https://github.com/TomNicho…
TomNicholas Oct 13, 2024
ea99430
remove todo about improving error message
TomNicholas Oct 13, 2024
64bb8ba
correct statement in data-structures docs
TomNicholas Oct 13, 2024
82a70a0
fix internal link
TomNicholas Oct 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/user-guide/data-structures.rst
Original file line number Diff line number Diff line change
Expand Up @@ -800,6 +800,7 @@ included by default unless you exclude them with the ``inherited`` flag:

dt2["/weather/temperature"].to_dataset(inherited=False)

For more examples and further discussion see LINK
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved

.. _coordinates:

Expand Down
166 changes: 165 additions & 1 deletion doc/user-guide/hierarchical-data.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _hierarchical-data:

Hierarchical data
==============================
=================

.. ipython:: python
:suppress:
Expand All @@ -15,6 +15,8 @@ Hierarchical data

%xmode minimal

.. _why:

Why Hierarchical Data?
----------------------

Expand Down Expand Up @@ -644,3 +646,165 @@ We could use this feature to quickly calculate the electrical power in our signa

power = currents * voltages
power

.. _alignment-and-coordinate-inheritance:

Alignment and Coordinate Inheritance
------------------------------------

.. _data-alignment:

Data Alignment
~~~~~~~~~~~~~~
Comment on lines +657 to +658
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add comment about open_groups being useful if your data doesn't align

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only note I have on open_groups, it probably deserves more. https://github.com/pydata/xarray/blob/main/doc/getting-started-guide/quick-overview.rst?plain=1#L284

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gonna prioritize merging this and improving documentation for open_groups later


The data in different datatree nodes are not totally independent. In particular dimensions (and indexes) in child nodes must be aligned (LINK HERE) with those in their parent nodes.
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved

.. note::
If you were a previous user of the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package, this is different from what you're used to!
In that package the data model was that nodes actually were completely unrelated. The data model is now slightly stricter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible nit (feel free to ignore): Would it be clearer to say the information (or specifically Dataset object) contained on each node was unrelated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a great point, it would definitely be both more clear and more accurate to say that instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified in 9b8fc9b

This allows us to provide features like :ref:`coordinate-inheritance`. See the migration guide for more details on the differences (LINK).

To demonstrate, let's first generate some example datasets which are not aligned with one another:

.. ipython:: python

# (drop the attributes just to make the printed representation shorter)
ds = xr.tutorial.open_dataset("air_temperature").drop_attrs()

ds_daily = ds.resample(time="D").mean("time")
ds_weekly = ds.resample(time="W").mean("time")
ds_monthly = ds.resample(time="ME").mean("time")

These datasets have different lengths along the ``time`` dimension, and are therefore not aligned along that dimension.

.. ipython:: python

ds_daily.sizes
ds_weekly.sizes
ds_monthly.sizes

We cannot store these non-alignable variables on a single :py:class:`~xarray.Dataset` object, because they do not exactly align:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be more correct to say that we cannot store them unchanged.


.. ipython:: python
:okexcept:

xr.align(ds_daily, ds_weekly, ds_monthly, join="exact")

But we :ref:`previously said <why>` that multi-resolution data is a good use case for :py:class:`~xarray.DataTree`, so surely we should be able to store these in a single :py:class:`~xarray.DataTree`?
If we first try to create a :py:class:`~xarray.DataTree` with these different-length time dimensions present in both parents and children, we will still get an alignment error:

.. ipython:: python
:okexcept:

xr.DataTree.from_dict({"daily": ds_daily, "daily/weekly": ds_weekly})

(TODO: Looks like this error message could be improved by including information about which sizes are not equal.)

This is because DataTree checks that data in child nodes align exactly with their parents.

.. note::
This requirement of aligned dimensions is similar to netCDF's concept of `inherited dimensions <https://www.unidata.ucar.edu/software/netcdf/workshops/2007/groups-types/Introduction.html>`_, as in netCDF-4 files dimensions are `visible to all child groups <https://docs.unidata.ucar.edu/netcdf-c/current/groups.html>`_.

This alignment check is performed up through the tree, all the way to the root, and so is therefore equivalent to requiring that this :py:func:`~xarray.align` command succeeds:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before getting to this statement, I had added a comment saying we should make it clear that the alignment check ensures alignment with all ancestors, not just the immediate parent. But this covers it nicely!


.. code:: python

xr.align(child.dataset, parent.dataset for parent in child.parents, join="exact")
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved

To represent our unalignable data in a single :py:class:`~xarray.DataTree`, we must instead place all variables which are a function of these different-length dimensions into nodes that are not direct descendents of one another, e.g. organize them as siblings.

.. ipython:: python

dt = xr.DataTree.from_dict(
{"daily": ds_daily, "weekly": ds_weekly, "monthly": ds_monthly}
)
dt

Now we have a valid :py:class:`~xarray.DataTree` structure which contains all the data at each different time frequency, stored in a separate group.

This is a useful way to organise our data because we can still operate on all the groups at once.
For example we can extract all three timeseries at a specific lat-lon location:

.. ipython:: python

dt.sel(lat=75, lon=300)

or compute the standard deviation of each timeseries to find out how it varies with sampling frequency:

.. ipython:: python

dt.std(dim="time")

.. _coordinate-inheritance:

Coordinate Inheritance
~~~~~~~~~~~~~~~~~~~~~~

Notice that in the trees we constructed above (LINK OR DISPLAY AGAIN?) there is some redundancy - the ``lat`` and ``lon`` variables appear in each sibling group, but are identical across the groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LINK OR DISPLAY AGAIN

I'm tempted to say display it again after this paragraph.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in d2918bb


We can use "Coordinate Inheritance" to define them only once in a parent group and remove this redundancy, whilst still being able to access those coordinate variables from the child groups.

.. note::
This is also a new feature relative to the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package.

Let's instead place only the time-dependent variables in the child groups, and put the non-time-dependent ``lat`` and ``lon`` variables in the parent (root) group:

.. ipython:: python

dt = xr.DataTree.from_dict(
{
"/": ds.drop_dims("time"),
"daily": ds_daily.drop_vars(["lat", "lon"]),
"weekly": ds_weekly.drop_vars(["lat", "lon"]),
"monthly": ds_monthly.drop_vars(["lat", "lon"]),
}
)
dt

This is preferred to the previous representation because it now makes it clear that all of these datasets share common spatial grid coordinates.
Defining the common coordinates just once also ensures that the spatial coordinates for each group cannot become out of sync with one another during operations.

We can still access the coordinates defined in the parent groups from any of the child groups as if they were actually present on the child groups:

.. ipython:: python

dt.daily.coords
dt["daily/lat"]

(TODO: the repr of ``dt.coords`` should display which coordinates are inherited)

As we can still access them, we say that the ``lat`` and ``lon`` coordinates in the child groups have been "inherited" from their common parent group.

If we print just one of the child nodes, it will still display inherited coordinates, but explicitly mark them as such:

.. ipython:: python

print(dt["/daily"])

This helps to differentiate which variables are defined on the datatree node that you are currently looking at, and which were defined somewhere above it.

We can also still perform all the same operations on the whole tree:

.. ipython:: python
:okexcept:

dt.sel(lat=75, lon=300)

dt.std(dim="time")

(TODO: The first one repeats coordinates in the result due to https://github.com/pydata/xarray/issues/9475)

(TODO: The second one fails due to https://github.com/pydata/xarray/issues/8949)

.. _overriding-inherited-coordinates:

Overriding Inherited Coordinates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can override inherited coordinates with newly-defined ones, as long as those newly-defined coordinates also align with the parent nodes.

EXAMPLE OF THIS? WOULD IT MAKE MORE SENSE TO USE DIFFERENT DATA TO DEMONSTRATE THIS?

EXAMPLE OF INHERITING FROM A GRANDPARENT?

EXPLAIN DEDUPLICATION?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to include these points in this PR, or merge what is here (maybe with this commented out) and then add more content later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to do it in this PR, but given that everyone seems to be happy with what's here already, and this is a natural break point, perhaps I will just merge this for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up issue could be to "document the subtleties of coordinate inheritance"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I had hoped to add these bits before you reviewed it @owenlittlejohns )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed that content for use in a future PR in 6cab6f8

Loading