diff --git a/doc/internals/index.rst b/doc/internals/index.rst index 666f8fd2343..ad67037b2f0 100644 --- a/doc/internals/index.rst +++ b/doc/internals/index.rst @@ -1,6 +1,6 @@ .. _internals: -xarray Internals +Xarray Internals ================ Xarray builds upon two of the foundational libraries of the scientific Python @@ -11,15 +11,15 @@ compiled code to :ref:`optional dependencies`. The pages in this section are intended for: * Contributors to xarray who wish to better understand some of the internals, -* Developers who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users, -* Developers who wish to interface xarray with their existing tooling, e.g. by creating a plugin for reading a new file format, or wrapping a custom array type. +* Developers from other fields who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users, +* Developers of other packages who wish to interface xarray with their existing tools, e.g. by creating a plugin for reading a new file format, or wrapping a custom array type. .. toctree:: :maxdepth: 2 :hidden: - variable-objects + internal-design duck-arrays-integration chunked-arrays extending-xarray diff --git a/doc/internals/internal-design.rst b/doc/internals/internal-design.rst new file mode 100644 index 00000000000..a11d6b3cfd7 --- /dev/null +++ b/doc/internals/internal-design.rst @@ -0,0 +1,138 @@ +.. _internal design: + +Internal Design +=============== + +This page gives an overview of the internal design of xarray. + +In totality, the Xarray project defines 4 key data structures. +In order of increasing complexity, they are: + +- :py:class:`xarray.Variable`, +- :py:class:`xarray.DataArray`, +- :py:class:`xarray.Dataset`, +- :py:class:`datatree.DataTree`. + +The user guide lists only :py:class:`xarray.DataArray` and :py:class:`xarray.Dataset`, +but :py:class:`~xarray.Variable` is the fundamental object internally, +and :py:class:`~datatree.DataTree` is a natural generalisation of :py:class:`xarray.Dataset`. + +.. note:: + + Our :ref:`roadmap` includes plans both to document :py:class:`~xarray.Variable` as fully public API, + and to merge the `xarray-datatree `_ package into xarray's main repository. + +Internally private :ref:`lazy indexing classes ` are used to avoid loading more data than necessary, +and flexible indexes classes (derived from :py:class:`~xarray.indexes.Index`) provide performant label-based lookups. + + +.. _internal design.data structures: + +Data Structures +--------------- + +The :ref:`data structures` page in the user guide explains the basics and concentrates on user-facing behavior, +whereas this section explains how xarray's data structure classes actually work internally. + + +.. _internal design.data structures.variable: + +Variable Objects +~~~~~~~~~~~~~~~~ + +The core internal data structure in xarray is the :py:class:`~xarray.Variable`, +which is used as the basic building block behind xarray's +:py:class:`~xarray.Dataset`, :py:class:`~xarray.DataArray` types. A +:py:class:`~xarray.Variable` consists of: + +- ``dims``: A tuple of dimension names. +- ``data``: The N-dimensional array (typically a NumPy or Dask array) storing + the Variable's data. It must have the same number of dimensions as the length + of ``dims``. +- ``attrs``: An ordered dictionary of metadata associated with this array. By + convention, xarray's built-in operations never use this metadata. +- ``encoding``: Another ordered dictionary used to store information about how + these variable's data is represented on disk. See :ref:`io.encoding` for more + details. + +:py:class:`~xarray.Variable` has an interface similar to NumPy arrays, but extended to make use +of named dimensions. For example, it uses ``dim`` in preference to an ``axis`` +argument for methods like ``mean``, and supports :ref:`compute.broadcasting`. + +However, unlike ``Dataset`` and ``DataArray``, the basic ``Variable`` does not +include coordinate labels along each axis. + +:py:class:`~xarray.Variable` is public API, but because of its incomplete support for labeled +data, it is mostly intended for advanced uses, such as in xarray itself, for +writing new backends, or when creating custom indexes. +You can access the variable objects that correspond to xarray objects via the (readonly) +:py:attr:`Dataset.variables ` and +:py:attr:`DataArray.variable ` attributes. + + +.. _internal design.dataarray: + +DataArray Objects +~~~~~~~~~~~~~~~~~ + +The simplest data structure used by most users is :py:class:`~xarray.DataArray`. +A :py:class:`~xarray.DataArray` is a composite object consisting of multiple +:py:class:`~xarray.core.variable.Variable` objects which store related data. + +A single :py:class:`~xarray.core.Variable` is referred to as the "data variable", and stored under the :py:attr:`~xarray.DataArray.variable`` attribute. +A :py:class:`~xarray.DataArray` inherits all of the properties of this data variable, i.e. ``dims``, ``data``, ``attrs`` and ``encoding``, +all of which are implemented by forwarding on to the underlying ``Variable`` object. + +In addition, a :py:class:`~xarray.DataArray` stores additional ``Variable`` objects stored in a dict under the private ``_coords`` attribute, +each of which is referred to as a "Coordinate Variable". These coordinate variable objects are only allowed to have ``dims`` that are a subset of the data variable's ``dims``, +and each dim has a specific length. This means that the full :py:attr:`~xarray.DataArray.size` of the dataarray can be represented by a dictionary mapping dimension names to integer sizes. +The underlying data variable has this exact same size, and the attached coordinate variables have sizes which are some subset of the size of the data variable. +Another way of saying this is that all coordinate variables must be "alignable" with the data variable. + +When a coordinate is accessed by the user (e.g. via the dict-like :py:class:`~xarray.DataArray.__getitem__` syntax), +then a new ``DataArray`` is constructed by finding all coordinate variables that have compatible dimensions and re-attaching them before the result is returned. +This is why most users never see the ``Variable`` class underlying each coordinate variable - it is always promoted to a ``DataArray`` before returning. + +Lookups are performed by special :py:class:`~xarray.indexes.Index` objects, which are stored in a dict under the private ``_indexes`` attribute. +Indexes must be associated with one or more coordinates, and essentially act by translating a query given in physical coordinate space +(typically via the :py:meth:`~xarray.DataArray.sel` method) into a set of integer indices in array index space that can be used to index the underlying n-dimensional array-like ``data``. +Indexing in array index space (typically performed via the :py:meth:`~xarray.DataArray.sel` method) does not require consulting an ``Index`` object. + +Finally a :py:class:`~xarray.DataArray` defines a :py:attr:`~xarray.DataArray.name` attribute, which refers to its data +variable but is stored on the wrapping ``DataArray`` class. +The ``name`` attribute is primarily used when one or more :py:class:`~xarray.DataArray` objects are promoted into a :py:class:`~xarray.Dataset` +(e.g. via :py:meth:`~xarray.DataArray.to_dataset`). +Note that the underlying :py:class:`~xarray.core.Variable` objects are all unnamed, so they can always be referred to uniquely via a +dict-like mapping. + +.. _internal design.dataset: + +Dataset Objects +~~~~~~~~~~~~~~~ + +The :py:class:`~xarray.Dataset` class is a generalization of the :py:class:`~xarray.DataArray` class that can hold multiple data variables. +Internally all data variables and coordinate variables are stored under a single ``variables`` dict, and coordinates are +specified by storing their names in a private ``_coord_names`` dict. + +The dataset's ``dims`` are the set of all dims present across any variable, but (similar to in dataarrays) coordinate +variables cannot have a dimension that is not present on any data variable. + +When a data variable or coordinate variable is accessed, a new ``DataArray`` is again constructed from all compatible +coordinates before returning. + +.. _internal design.subclassing: + +.. note:: + + The way that selecting a variable from a ``DataArray`` or ``Dataset`` actually involves internally wrapping the + ``Variable`` object back up into a ``DataArray``/``Dataset`` is the primary reason :ref:`we recommend against subclassing ` + Xarray objects. The main problem it creates is that we currently cannot easily guarantee that for example selecting + a coordinate variable from your ``SubclassedDataArray`` would return an instance of ``SubclassedDataArray`` instead + of just an :py:class:`xarray.DataArray`. See `GH issue `_ for more details. + +.. _internal design.lazy indexing: + +Lazy Indexing Classes +--------------------- + +TODO diff --git a/doc/internals/variable-objects.rst b/doc/internals/variable-objects.rst deleted file mode 100644 index 6ae3c2f7e6d..00000000000 --- a/doc/internals/variable-objects.rst +++ /dev/null @@ -1,31 +0,0 @@ -Variable objects -================ - -The core internal data structure in xarray is the :py:class:`~xarray.Variable`, -which is used as the basic building block behind xarray's -:py:class:`~xarray.Dataset` and :py:class:`~xarray.DataArray` types. A -``Variable`` consists of: - -- ``dims``: A tuple of dimension names. -- ``data``: The N-dimensional array (typically, a NumPy or Dask array) storing - the Variable's data. It must have the same number of dimensions as the length - of ``dims``. -- ``attrs``: An ordered dictionary of metadata associated with this array. By - convention, xarray's built-in operations never use this metadata. -- ``encoding``: Another ordered dictionary used to store information about how - these variable's data is represented on disk. See :ref:`io.encoding` for more - details. - -``Variable`` has an interface similar to NumPy arrays, but extended to make use -of named dimensions. For example, it uses ``dim`` in preference to an ``axis`` -argument for methods like ``mean``, and supports :ref:`compute.broadcasting`. - -However, unlike ``Dataset`` and ``DataArray``, the basic ``Variable`` does not -include coordinate labels along each axis. - -``Variable`` is public API, but because of its incomplete support for labeled -data, it is mostly intended for advanced uses, such as in xarray itself or for -writing new backends. You can access the variable objects that correspond to -xarray objects via the (readonly) :py:attr:`Dataset.variables -` and -:py:attr:`DataArray.variable ` attributes.