Skip to content

Commit

Permalink
Expose "Coordinates" as part of Xarray's public API (#7368)
Browse files Browse the repository at this point in the history
* add indexes argument to Dataset.__init__

* make indexes arg public for DataArray.__init__

* Indexes constructor updates

- easily create an empty Indexes collection
- check consistency between indexes and variables

* use the generic Mapping[Any, Index] for indexes

* add wrap_pandas_multiindex function

* do not create default indexes when not desired

* fix Dataset dimensions

TODO: check indexes shapes / dims for DataArray

* copy the coordinate variables of passed indexes

* DataArray: check dimensions/shape of index coords

* docstrings tweaks

* more Indexes safety

Since its constructor can now be used publicly.

Copy input mappings and check the type of input indexes.

* ensure input indexes are Xarray indexes

* add .assign_indexes() method

* add `IndexedCoordinates` subclass

+ add `IndexedCoordinates.from_pandas_multiindex` helper.

* rollback/update Dataset and DataArray constructors

Drop the `indexes` argument or keep it as private API.

When a `Coordinates` object is passed as `coords` argument, extract both
coordinate variables and indexes and add them to the new Dataset or
DataArray.

* update docstrings

* fix Dataset creation internal error

* add IndexedCoordinates.merge_coords

* drop IndexedCoordinates and reuse Coordinates

* update api docs

* make Coordinates init args optional

* docstrings updates

* convert to base variable when no index is given

* raise when an index is given with no variable

* skip create default indexes...

... When a Coordinates object is given to the Dataset constructor

* invariant checks: maybe skip IndexVariable checks

... when check_default_indexes is False.

* add Coordinates tests

* more Coordinates tests

* add Dataset constructor tests with Coordinates

* fix mypy

* assign_coords: do not create default indexes...

... when passing a Coordinates object

* support alignment of Coordinates

* clean-up

* fix failing test (dataarray coords not extracted)

* fix tests: prevent index conflicts

Do not extract multi-coordinate indexes from DataArray if they are
overwritten or dropped (dimension coordinate).

* add Coordinates.equals and Coordinates.identical

* more tests, docstrings, docs

* fix assert_* (Coordinates subclasses)

* review copy

* another few tests

* fix mypy

* update what's new

* do not copy indexes

May corrupt multi-coordinate indexes.

* add Coordinates fastpath constructor

* fix sphinx directive

* re-add coord indexes in merge (dataset constructor)

This re-enables the optimization in deep_align that skips
alignment for any alignable (DataArray) in a dict that
matches an index key.

* create coords with default idx: try a cleaner impl

Coordinate variables and indexes extracted from DataArrays should be
merged more properly.

* some useful comments for later

* xr.merge: add support for Coordinates objects

* allow skip align for object(s) in merge_core

This fixes the decrease in performance observed in Dataset creation
benchmarks.

When creating a new Dataset, the variables and indexes in `Coordinates`
should already be aligned together so it doesn't need to go through the
complex alignment logic once again. `Coordinates` indexes are still used
to align data variables.

* fix mypy

* what's new tweaks

* align Coordinates callbacks: don't reindex data vars

* fix Coordinates._overwrite_indexes callback

mypy was rightfully complaining. This callback is called from Aligner
only, which passes the first two arguments and ignores the rest.

* remove merge_coords

* futurewarning: pass multi-index via data vars

* review comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix circulat imports

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* typing: add Alignable protocol class

* try fixing mypy error (Self redefinition)

* remove Coordinate alias of Variable

Much water has flowed under the bridge since it has been renamed.

* fix groupby test

* doc: remove merge_coords in api reference

* doc: improve docstrings and glossary

* use Self type annotation in Coordinate class

* better comment

* fix Self undefined error with python < 3.11

Pyright displays an info message "Self is not valid in this context" but
most important this should avoid runtime errors with python < 3.11.

---------

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
  • Loading branch information
4 people authored Jul 21, 2023
1 parent efa2863 commit 4441f99
Show file tree
Hide file tree
Showing 21 changed files with 1,103 additions and 277 deletions.
48 changes: 38 additions & 10 deletions doc/api-hidden.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,40 @@
.. autosummary::
:toctree: generated/

Coordinates.from_pandas_multiindex
Coordinates.get
Coordinates.items
Coordinates.keys
Coordinates.values
Coordinates.dims
Coordinates.dtypes
Coordinates.variables
Coordinates.xindexes
Coordinates.indexes
Coordinates.to_dataset
Coordinates.to_index
Coordinates.update
Coordinates.merge
Coordinates.copy
Coordinates.equals
Coordinates.identical

core.coordinates.DatasetCoordinates.get
core.coordinates.DatasetCoordinates.items
core.coordinates.DatasetCoordinates.keys
core.coordinates.DatasetCoordinates.merge
core.coordinates.DatasetCoordinates.to_dataset
core.coordinates.DatasetCoordinates.to_index
core.coordinates.DatasetCoordinates.update
core.coordinates.DatasetCoordinates.values
core.coordinates.DatasetCoordinates.dims
core.coordinates.DatasetCoordinates.indexes
core.coordinates.DatasetCoordinates.dtypes
core.coordinates.DatasetCoordinates.variables
core.coordinates.DatasetCoordinates.xindexes
core.coordinates.DatasetCoordinates.indexes
core.coordinates.DatasetCoordinates.to_dataset
core.coordinates.DatasetCoordinates.to_index
core.coordinates.DatasetCoordinates.update
core.coordinates.DatasetCoordinates.merge
core.coordinates.DataArrayCoordinates.copy
core.coordinates.DatasetCoordinates.equals
core.coordinates.DatasetCoordinates.identical

core.rolling.DatasetCoarsen.boundary
core.rolling.DatasetCoarsen.coord_func
Expand Down Expand Up @@ -47,14 +70,19 @@
core.coordinates.DataArrayCoordinates.get
core.coordinates.DataArrayCoordinates.items
core.coordinates.DataArrayCoordinates.keys
core.coordinates.DataArrayCoordinates.merge
core.coordinates.DataArrayCoordinates.to_dataset
core.coordinates.DataArrayCoordinates.to_index
core.coordinates.DataArrayCoordinates.update
core.coordinates.DataArrayCoordinates.values
core.coordinates.DataArrayCoordinates.dims
core.coordinates.DataArrayCoordinates.indexes
core.coordinates.DataArrayCoordinates.dtypes
core.coordinates.DataArrayCoordinates.variables
core.coordinates.DataArrayCoordinates.xindexes
core.coordinates.DataArrayCoordinates.indexes
core.coordinates.DataArrayCoordinates.to_dataset
core.coordinates.DataArrayCoordinates.to_index
core.coordinates.DataArrayCoordinates.update
core.coordinates.DataArrayCoordinates.merge
core.coordinates.DataArrayCoordinates.copy
core.coordinates.DataArrayCoordinates.equals
core.coordinates.DataArrayCoordinates.identical

core.rolling.DataArrayCoarsen.boundary
core.rolling.DataArrayCoarsen.coord_func
Expand Down
1 change: 1 addition & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1085,6 +1085,7 @@ Advanced API
.. autosummary::
:toctree: generated/

Coordinates
Dataset.variables
DataArray.variable
Variable
Expand Down
69 changes: 44 additions & 25 deletions doc/user-guide/terminology.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,23 +54,22 @@ complete examples, please consult the relevant documentation.*
Coordinate
An array that labels a dimension or set of dimensions of another
``DataArray``. In the usual one-dimensional case, the coordinate array's
values can loosely be thought of as tick labels along a dimension. There
are two types of coordinate arrays: *dimension coordinates* and
*non-dimension coordinates* (see below). A coordinate named ``x`` can be
retrieved from ``arr.coords[x]``. A ``DataArray`` can have more
coordinates than dimensions because a single dimension can be labeled by
multiple coordinate arrays. However, only one coordinate array can be a
assigned as a particular dimension's dimension coordinate array. As a
values can loosely be thought of as tick labels along a dimension. We
distinguish :term:`Dimension coordinate` vs. :term:`Non-dimension
coordinate` and :term:`Indexed coordinate` vs. :term:`Non-indexed
coordinate`. A coordinate named ``x`` can be retrieved from
``arr.coords[x]``. A ``DataArray`` can have more coordinates than
dimensions because a single dimension can be labeled by multiple
coordinate arrays. However, only one coordinate array can be a assigned
as a particular dimension's dimension coordinate array. As a
consequence, ``len(arr.dims) <= len(arr.coords)`` in general.

Dimension coordinate
A one-dimensional coordinate array assigned to ``arr`` with both a name
and dimension name in ``arr.dims``. Dimension coordinates are used for
label-based indexing and alignment, like the index found on a
:py:class:`pandas.DataFrame` or :py:class:`pandas.Series`. In fact,
dimension coordinates use :py:class:`pandas.Index` objects under the
hood for efficient computation. Dimension coordinates are marked by
``*`` when printing a ``DataArray`` or ``Dataset``.
and dimension name in ``arr.dims``. Usually (but not always), a
dimension coordinate is also an :term:`Indexed coordinate` so that it can
be used for label-based indexing and alignment, like the index found on
a :py:class:`pandas.DataFrame` or :py:class:`pandas.Series`.

Non-dimension coordinate
A coordinate array assigned to ``arr`` with a name in ``arr.coords`` but
Expand All @@ -79,20 +78,40 @@ complete examples, please consult the relevant documentation.*
example, multidimensional coordinates are often used in geoscience
datasets when :doc:`the data's physical coordinates (such as latitude
and longitude) differ from their logical coordinates
<../examples/multidimensional-coords>`. However, non-dimension coordinates
are not indexed, and any operation on non-dimension coordinates that
leverages indexing will fail. Printing ``arr.coords`` will print all of
``arr``'s coordinate names, with the corresponding dimension(s) in
parentheses. For example, ``coord_name (dim_name) 1 2 3 ...``.
<../examples/multidimensional-coords>`. Printing ``arr.coords`` will
print all of ``arr``'s coordinate names, with the corresponding
dimension(s) in parentheses. For example, ``coord_name (dim_name) 1 2 3
...``.

Indexed coordinate
A coordinate which has an associated :term:`Index`. Generally this means
that the coordinate labels can be used for indexing (selection) and/or
alignment. An indexed coordinate may have one or more arbitrary
dimensions although in most cases it is also a :term:`Dimension
coordinate`. It may or may not be grouped with other indexed coordinates
depending on whether they share the same index. Indexed coordinates are
marked by ``*`` when printing a ``DataArray`` or ``Dataset``.

Non-indexed coordinate
A coordinate which has no associated :term:`Index`. It may still
represent fixed labels along one or more dimensions but it cannot be
used for label-based indexing and alignment.

Index
An *index* is a data structure optimized for efficient selecting and
slicing of an associated array. Xarray creates indexes for dimension
coordinates so that operations along dimensions are fast, while
non-dimension coordinates are not indexed. Under the hood, indexes are
implemented as :py:class:`pandas.Index` objects. The index associated
with dimension name ``x`` can be retrieved by ``arr.indexes[x]``. By
construction, ``len(arr.dims) == len(arr.indexes)``
An *index* is a data structure optimized for efficient data selection
and alignment within a discrete or continuous space that is defined by
coordinate labels (unless it is a functional index). By default, Xarray
creates a :py:class:`~xarray.indexes.PandasIndex` object (i.e., a
:py:class:`pandas.Index` wrapper) for each :term:`Dimension coordinate`.
For more advanced use cases (e.g., staggered or irregular grids,
geospatial indexes), Xarray also accepts any instance of a specialized
:py:class:`~xarray.indexes.Index` subclass that is associated to one or
more arbitrary coordinates. The index associated with the coordinate
``x`` can be retrieved by ``arr.xindexes[x]`` (or ``arr.indexes["x"]``
if the index is convertible to a :py:class:`pandas.Index` object). If
two coordinates ``x`` and ``y`` share the same index,
``arr.xindexes[x]`` and ``arr.xindexes[y]`` both return the same
:py:class:`~xarray.indexes.Index` object.

name
The names of dimensions, coordinates, DataArray objects and data
Expand Down
14 changes: 14 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,20 @@ v2023.07.1 (unreleased)
New Features
~~~~~~~~~~~~

- :py:class:`Coordinates` can now be constructed independently of any Dataset or
DataArray (it is also returned by the :py:attr:`Dataset.coords` and
:py:attr:`DataArray.coords` properties). ``Coordinates`` objects are useful for
passing both coordinate variables and indexes to new Dataset / DataArray objects,
e.g., via their constructor or via :py:meth:`Dataset.assign_coords`. We may also
wrap coordinate variables in a ``Coordinates`` object in order to skip
the automatic creation of (pandas) indexes for dimension coordinates.
The :py:class:`Coordinates.from_pandas_multiindex` constructor may be used to
create coordinates directly from a :py:class:`pandas.MultiIndex` object (it is
preferred over passing it directly as coordinate data, which may be deprecated soon).
Like Dataset and DataArray objects, ``Coordinates`` objects may now be used in
:py:func:`align` and :py:func:`merge`.
(:issue:`6392`, :pull:`7368`).
By `Benoît Bovy <https://github.com/benbovy>`_.
- Visually group together coordinates with the same indexes in the index section of the text repr (:pull:`7225`).
By `Justus Magin <https://github.com/keewis>`_.
- Allow creating Xarray objects where a multidimensional variable shares its name
Expand Down
4 changes: 3 additions & 1 deletion xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
where,
)
from xarray.core.concat import concat
from xarray.core.coordinates import Coordinates
from xarray.core.dataarray import DataArray
from xarray.core.dataset import Dataset
from xarray.core.extensions import (
Expand All @@ -37,7 +38,7 @@
from xarray.core.merge import Context, MergeError, merge
from xarray.core.options import get_options, set_options
from xarray.core.parallel import map_blocks
from xarray.core.variable import Coordinate, IndexVariable, Variable, as_variable
from xarray.core.variable import IndexVariable, Variable, as_variable
from xarray.util.print_versions import show_versions

try:
Expand Down Expand Up @@ -100,6 +101,7 @@
"CFTimeIndex",
"Context",
"Coordinate",
"Coordinates",
"DataArray",
"Dataset",
"Index",
Expand Down
Loading

0 comments on commit 4441f99

Please sign in to comment.