Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datatree alignment docs #9501

Merged
merged 41 commits into from
Oct 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
ae71437
remove too-long underline
TomNicholas Sep 15, 2024
928767a
draft section on data alignment
TomNicholas Sep 15, 2024
1adb945
fixes
TomNicholas Sep 15, 2024
ae1bcfd
draft section on coordinate inheritance
TomNicholas Sep 15, 2024
f025371
various improvements
TomNicholas Sep 15, 2024
7549ee9
more improvements
TomNicholas Sep 15, 2024
b631697
link from other page
TomNicholas Sep 15, 2024
02bf96b
align call include all 3 datasets
TomNicholas Sep 15, 2024
152d74a
link back to use cases
TomNicholas Sep 15, 2024
57b7f06
clarification
TomNicholas Sep 15, 2024
d3ac1a7
small improvements
TomNicholas Sep 15, 2024
adf7579
Merge branch 'main' into datatree_alignment_docs
TomNicholas Sep 23, 2024
d73dd8a
remove TODO after #9532
TomNicholas Sep 23, 2024
d779e22
add todo about #9475
TomNicholas Sep 23, 2024
3c9ad55
correct xr.align example call
TomNicholas Sep 23, 2024
5a4309a
Merge branch 'datatree_alignment_docs' of https://github.com/TomNicho…
TomNicholas Sep 23, 2024
4cee745
add links to netCDF4 documentation
TomNicholas Sep 23, 2024
4c030d8
Consistent voice
TomNicholas Sep 23, 2024
09385fd
Merge branch 'main' into datatree_alignment_docs
TomNicholas Sep 26, 2024
35ab311
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 6, 2024
6db4a0b
keep indexes in lat lon selection to dodge #9475
TomNicholas Oct 6, 2024
22f2726
Merge branch 'datatree_alignment_docs' of https://github.com/TomNicho…
TomNicholas Oct 6, 2024
e879dbb
unpack generator properly
TomNicholas Oct 6, 2024
401c6b0
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 10, 2024
118e802
ideas for next section
TomNicholas Oct 10, 2024
c129eb1
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 11, 2024
b245bdd
briefly summarize what alignment means
TomNicholas Oct 12, 2024
9b8fc9b
clarify that it's the data in each node that was previously unrelated
TomNicholas Oct 12, 2024
b6385ce
fix incorrect indentation of code block
TomNicholas Oct 12, 2024
d2918bb
display the tree with redundant coordinates again
TomNicholas Oct 12, 2024
6cab6f8
remove content about non-inherited coords for a follow-up PR
TomNicholas Oct 12, 2024
af5c6b7
remove todo
TomNicholas Oct 12, 2024
00105a4
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 13, 2024
d49c2de
remove todo now that aggregations are re-implemented
TomNicholas Oct 13, 2024
a3d5223
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 13, 2024
44b14ef
Merge branch 'main' into datatree_alignment_docs
TomNicholas Oct 13, 2024
44bcf6c
remove link to (unmerged) migration guide
TomNicholas Oct 13, 2024
ee78160
Merge branch 'datatree_alignment_docs' of https://github.com/TomNicho…
TomNicholas Oct 13, 2024
ea99430
remove todo about improving error message
TomNicholas Oct 13, 2024
64bb8ba
correct statement in data-structures docs
TomNicholas Oct 13, 2024
82a70a0
fix internal link
TomNicholas Oct 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/user-guide/data-structures.rst
Original file line number Diff line number Diff line change
Expand Up @@ -771,7 +771,7 @@ Here there are four different coordinate variables, which apply to variables in
``station`` is used only for ``weather`` variables
``lat`` and ``lon`` are only use for ``satellite`` images

Coordinate variables are inherited to descendent nodes, which means that
Coordinate variables are inherited to descendent nodes, which is only possible because
variables at different levels of a hierarchical DataTree are always
aligned. Placing the ``time`` variable at the root node automatically indicates
that it applies to all descendent nodes. Similarly, ``station`` is in the base
Expand Down Expand Up @@ -800,6 +800,7 @@ included by default unless you exclude them with the ``inherit`` flag:

dt2["/weather/temperature"].to_dataset(inherit=False)

For more examples and further discussion see :ref:`alignment and coordinate inheritance <hierarchical-data.alignment-and-coordinate-inheritance>`.

.. _coordinates:

Expand Down
151 changes: 149 additions & 2 deletions doc/user-guide/hierarchical-data.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _hierarchical-data:
.. _userguide.hierarchical-data:

Hierarchical data
==============================
=================

.. ipython:: python
:suppress:
Expand All @@ -15,6 +15,8 @@ Hierarchical data

%xmode minimal

.. _why:

Why Hierarchical Data?
----------------------

Expand Down Expand Up @@ -644,3 +646,148 @@ We could use this feature to quickly calculate the electrical power in our signa

power = currents * voltages
power

.. _hierarchical-data.alignment-and-coordinate-inheritance:

Alignment and Coordinate Inheritance
------------------------------------

.. _data-alignment:

Data Alignment
~~~~~~~~~~~~~~

Comment on lines +657 to +658
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add comment about open_groups being useful if your data doesn't align

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only note I have on open_groups, it probably deserves more. https://github.com/pydata/xarray/blob/main/doc/getting-started-guide/quick-overview.rst?plain=1#L284

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gonna prioritize merging this and improving documentation for open_groups later

The data in different datatree nodes are not totally independent. In particular dimensions (and indexes) in child nodes must be exactly aligned with those in their parent nodes.
Exact aligment means that shared dimensions must be the same length, and indexes along those dimensions must be equal.

.. note::
If you were a previous user of the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package, this is different from what you're used to!
In that package the data model was that the data stored in each node actually was completely unrelated. The data model is now slightly stricter.
This allows us to provide features like :ref:`coordinate-inheritance`.

To demonstrate, let's first generate some example datasets which are not aligned with one another:

.. ipython:: python

# (drop the attributes just to make the printed representation shorter)
ds = xr.tutorial.open_dataset("air_temperature").drop_attrs()

ds_daily = ds.resample(time="D").mean("time")
ds_weekly = ds.resample(time="W").mean("time")
ds_monthly = ds.resample(time="ME").mean("time")

These datasets have different lengths along the ``time`` dimension, and are therefore not aligned along that dimension.

.. ipython:: python

ds_daily.sizes
ds_weekly.sizes
ds_monthly.sizes

We cannot store these non-alignable variables on a single :py:class:`~xarray.Dataset` object, because they do not exactly align:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be more correct to say that we cannot store them unchanged.

.. ipython:: python
:okexcept:

xr.align(ds_daily, ds_weekly, ds_monthly, join="exact")

But we :ref:`previously said <why>` that multi-resolution data is a good use case for :py:class:`~xarray.DataTree`, so surely we should be able to store these in a single :py:class:`~xarray.DataTree`?
If we first try to create a :py:class:`~xarray.DataTree` with these different-length time dimensions present in both parents and children, we will still get an alignment error:

.. ipython:: python
:okexcept:

xr.DataTree.from_dict({"daily": ds_daily, "daily/weekly": ds_weekly})

This is because DataTree checks that data in child nodes align exactly with their parents.

.. note::
This requirement of aligned dimensions is similar to netCDF's concept of `inherited dimensions <https://www.unidata.ucar.edu/software/netcdf/workshops/2007/groups-types/Introduction.html>`_, as in netCDF-4 files dimensions are `visible to all child groups <https://docs.unidata.ucar.edu/netcdf-c/current/groups.html>`_.

This alignment check is performed up through the tree, all the way to the root, and so is therefore equivalent to requiring that this :py:func:`~xarray.align` command succeeds:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before getting to this statement, I had added a comment saying we should make it clear that the alignment check ensures alignment with all ancestors, not just the immediate parent. But this covers it nicely!

.. code:: python

xr.align(child.dataset, *(parent.dataset for parent in child.parents), join="exact")

To represent our unalignable data in a single :py:class:`~xarray.DataTree`, we must instead place all variables which are a function of these different-length dimensions into nodes that are not direct descendents of one another, e.g. organize them as siblings.

.. ipython:: python

dt = xr.DataTree.from_dict(
{"daily": ds_daily, "weekly": ds_weekly, "monthly": ds_monthly}
)
dt

Now we have a valid :py:class:`~xarray.DataTree` structure which contains all the data at each different time frequency, stored in a separate group.

This is a useful way to organise our data because we can still operate on all the groups at once.
For example we can extract all three timeseries at a specific lat-lon location:

.. ipython:: python

dt.sel(lat=75, lon=300)

or compute the standard deviation of each timeseries to find out how it varies with sampling frequency:

.. ipython:: python

dt.std(dim="time")

.. _coordinate-inheritance:

Coordinate Inheritance
~~~~~~~~~~~~~~~~~~~~~~

Notice that in the trees we constructed above there is some redundancy - the ``lat`` and ``lon`` variables appear in each sibling group, but are identical across the groups.

.. ipython:: python

dt

We can use "Coordinate Inheritance" to define them only once in a parent group and remove this redundancy, whilst still being able to access those coordinate variables from the child groups.

.. note::
This is also a new feature relative to the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package.

Let's instead place only the time-dependent variables in the child groups, and put the non-time-dependent ``lat`` and ``lon`` variables in the parent (root) group:

.. ipython:: python

dt = xr.DataTree.from_dict(
{
"/": ds.drop_dims("time"),
"daily": ds_daily.drop_vars(["lat", "lon"]),
"weekly": ds_weekly.drop_vars(["lat", "lon"]),
"monthly": ds_monthly.drop_vars(["lat", "lon"]),
}
)
dt

This is preferred to the previous representation because it now makes it clear that all of these datasets share common spatial grid coordinates.
Defining the common coordinates just once also ensures that the spatial coordinates for each group cannot become out of sync with one another during operations.

We can still access the coordinates defined in the parent groups from any of the child groups as if they were actually present on the child groups:

.. ipython:: python

dt.daily.coords
dt["daily/lat"]

As we can still access them, we say that the ``lat`` and ``lon`` coordinates in the child groups have been "inherited" from their common parent group.

If we print just one of the child nodes, it will still display inherited coordinates, but explicitly mark them as such:

.. ipython:: python

print(dt["/daily"])

This helps to differentiate which variables are defined on the datatree node that you are currently looking at, and which were defined somewhere above it.

We can also still perform all the same operations on the whole tree:

.. ipython:: python

dt.sel(lat=[75], lon=[300])

dt.std(dim="time")
Loading