Skip to content

Commit

Permalink
add page on internal design
Browse files Browse the repository at this point in the history
  • Loading branch information
TomNicholas committed Jul 17, 2023
1 parent a47ff4e commit 198f67b
Show file tree
Hide file tree
Showing 3 changed files with 142 additions and 35 deletions.
8 changes: 4 additions & 4 deletions doc/internals/index.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _internals:

xarray Internals
Xarray Internals
================

Xarray builds upon two of the foundational libraries of the scientific Python
Expand All @@ -11,15 +11,15 @@ compiled code to :ref:`optional dependencies<installing>`.
The pages in this section are intended for:

* Contributors to xarray who wish to better understand some of the internals,
* Developers who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users,
* Developers who wish to interface xarray with their existing tooling, e.g. by creating a plugin for reading a new file format, or wrapping a custom array type.
* Developers from other fields who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users,
* Developers of other packages who wish to interface xarray with their existing tools, e.g. by creating a plugin for reading a new file format, or wrapping a custom array type.


.. toctree::
:maxdepth: 2
:hidden:

variable-objects
internal-design
duck-arrays-integration
chunked-arrays
extending-xarray
Expand Down
138 changes: 138 additions & 0 deletions doc/internals/internal-design.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
.. _internal design:

Internal Design
===============

This page gives an overview of the internal design of xarray.

In totality, the Xarray project defines 4 key data structures.
In order of increasing complexity, they are:

- :py:class:`xarray.Variable`,
- :py:class:`xarray.DataArray`,
- :py:class:`xarray.Dataset`,
- :py:class:`datatree.DataTree`.

The user guide lists only :py:class:`xarray.DataArray` and :py:class:`xarray.Dataset`,
but :py:class:`~xarray.Variable` is the fundamental object internally,
and :py:class:`~datatree.DataTree` is a natural generalisation of :py:class:`xarray.Dataset`.

.. note::

Our :ref:`roadmap` includes plans both to document :py:class:`~xarray.Variable` as fully public API,
and to merge the `xarray-datatree <https://github.com/xarray-contrib/datatree>`_ package into xarray's main repository.

Internally private :ref:`lazy indexing classes <internal design.lazy indexing>` are used to avoid loading more data than necessary,
and flexible indexes classes (derived from :py:class:`~xarray.indexes.Index`) provide performant label-based lookups.


.. _internal design.data structures:

Data Structures
---------------

The :ref:`data structures` page in the user guide explains the basics and concentrates on user-facing behavior,
whereas this section explains how xarray's data structure classes actually work internally.


.. _internal design.data structures.variable:

Variable Objects
~~~~~~~~~~~~~~~~

The core internal data structure in xarray is the :py:class:`~xarray.Variable`,
which is used as the basic building block behind xarray's
:py:class:`~xarray.Dataset`, :py:class:`~xarray.DataArray` types. A
:py:class:`~xarray.Variable` consists of:

- ``dims``: A tuple of dimension names.
- ``data``: The N-dimensional array (typically a NumPy or Dask array) storing
the Variable's data. It must have the same number of dimensions as the length
of ``dims``.
- ``attrs``: An ordered dictionary of metadata associated with this array. By
convention, xarray's built-in operations never use this metadata.
- ``encoding``: Another ordered dictionary used to store information about how
these variable's data is represented on disk. See :ref:`io.encoding` for more
details.

:py:class:`~xarray.Variable` has an interface similar to NumPy arrays, but extended to make use
of named dimensions. For example, it uses ``dim`` in preference to an ``axis``
argument for methods like ``mean``, and supports :ref:`compute.broadcasting`.

However, unlike ``Dataset`` and ``DataArray``, the basic ``Variable`` does not
include coordinate labels along each axis.

:py:class:`~xarray.Variable` is public API, but because of its incomplete support for labeled
data, it is mostly intended for advanced uses, such as in xarray itself, for
writing new backends, or when creating custom indexes.
You can access the variable objects that correspond to xarray objects via the (readonly)
:py:attr:`Dataset.variables <xarray.Dataset.variables>` and
:py:attr:`DataArray.variable <xarray.DataArray.variable>` attributes.


.. _internal design.dataarray:

DataArray Objects
~~~~~~~~~~~~~~~~~

The simplest data structure used by most users is :py:class:`~xarray.DataArray`.
A :py:class:`~xarray.DataArray` is a composite object consisting of multiple
:py:class:`~xarray.core.variable.Variable` objects which store related data.

A single :py:class:`~xarray.core.Variable` is referred to as the "data variable", and stored under the :py:attr:`~xarray.DataArray.variable`` attribute.
A :py:class:`~xarray.DataArray` inherits all of the properties of this data variable, i.e. ``dims``, ``data``, ``attrs`` and ``encoding``,
all of which are implemented by forwarding on to the underlying ``Variable`` object.

In addition, a :py:class:`~xarray.DataArray` stores additional ``Variable`` objects stored in a dict under the private ``_coords`` attribute,
each of which is referred to as a "Coordinate Variable". These coordinate variable objects are only allowed to have ``dims`` that are a subset of the data variable's ``dims``,
and each dim has a specific length. This means that the full :py:attr:`~xarray.DataArray.size` of the dataarray can be represented by a dictionary mapping dimension names to integer sizes.
The underlying data variable has this exact same size, and the attached coordinate variables have sizes which are some subset of the size of the data variable.
Another way of saying this is that all coordinate variables must be "alignable" with the data variable.

When a coordinate is accessed by the user (e.g. via the dict-like :py:class:`~xarray.DataArray.__getitem__` syntax),
then a new ``DataArray`` is constructed by finding all coordinate variables that have compatible dimensions and re-attaching them before the result is returned.
This is why most users never see the ``Variable`` class underlying each coordinate variable - it is always promoted to a ``DataArray`` before returning.

Lookups are performed by special :py:class:`~xarray.indexes.Index` objects, which are stored in a dict under the private ``_indexes`` attribute.
Indexes must be associated with one or more coordinates, and essentially act by translating a query given in physical coordinate space
(typically via the :py:meth:`~xarray.DataArray.sel` method) into a set of integer indices in array index space that can be used to index the underlying n-dimensional array-like ``data``.
Indexing in array index space (typically performed via the :py:meth:`~xarray.DataArray.sel` method) does not require consulting an ``Index`` object.

Finally a :py:class:`~xarray.DataArray` defines a :py:attr:`~xarray.DataArray.name` attribute, which refers to its data
variable but is stored on the wrapping ``DataArray`` class.
The ``name`` attribute is primarily used when one or more :py:class:`~xarray.DataArray` objects are promoted into a :py:class:`~xarray.Dataset`
(e.g. via :py:meth:`~xarray.DataArray.to_dataset`).
Note that the underlying :py:class:`~xarray.core.Variable` objects are all unnamed, so they can always be referred to uniquely via a
dict-like mapping.

.. _internal design.dataset:

Dataset Objects
~~~~~~~~~~~~~~~

The :py:class:`~xarray.Dataset` class is a generalization of the :py:class:`~xarray.DataArray` class that can hold multiple data variables.
Internally all data variables and coordinate variables are stored under a single ``variables`` dict, and coordinates are
specified by storing their names in a private ``_coord_names`` dict.

The dataset's ``dims`` are the set of all dims present across any variable, but (similar to in dataarrays) coordinate
variables cannot have a dimension that is not present on any data variable.

When a data variable or coordinate variable is accessed, a new ``DataArray`` is again constructed from all compatible
coordinates before returning.

.. _internal design.subclassing:

.. note::

The way that selecting a variable from a ``DataArray`` or ``Dataset`` actually involves internally wrapping the
``Variable`` object back up into a ``DataArray``/``Dataset`` is the primary reason :ref:`we recommend against subclassing <internals.accessors.composition>`
Xarray objects. The main problem it creates is that we currently cannot easily guarantee that for example selecting
a coordinate variable from your ``SubclassedDataArray`` would return an instance of ``SubclassedDataArray`` instead
of just an :py:class:`xarray.DataArray`. See `GH issue <https://github.com/pydata/xarray/issues/3980>`_ for more details.

.. _internal design.lazy indexing:

Lazy Indexing Classes
---------------------

TODO
31 changes: 0 additions & 31 deletions doc/internals/variable-objects.rst

This file was deleted.

0 comments on commit 198f67b

Please sign in to comment.