Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new backend api documentation #4810

Merged
merged 58 commits into from
Mar 8, 2021
Merged
Changes from 8 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
7e33010
documentation first draft
aurghs Jan 14, 2021
75544c9
documentation update
aurghs Jan 14, 2021
c6f64cc
documentation update
aurghs Jan 14, 2021
11fc283
update documentation
aurghs Jan 26, 2021
1b8ac13
update backend documentation
aurghs Feb 2, 2021
e471f3a
incompletre draft: update Backend Documentation
aurghs Feb 2, 2021
65c339d
fix
aurghs Feb 3, 2021
113197d
fix syle
aurghs Feb 3, 2021
f9ed1d4
Update doc/internals.rst
aurghs Feb 4, 2021
6b2ecd5
Update doc/internals.rst
aurghs Feb 4, 2021
1958163
Update doc/internals.rst
aurghs Feb 4, 2021
bba32e4
Update doc/internals.rst
aurghs Feb 4, 2021
cb8d716
Update doc/internals.rst
aurghs Feb 4, 2021
2adf355
Update doc/internals.rst
aurghs Feb 4, 2021
7ec3238
Update doc/internals.rst
aurghs Feb 4, 2021
fb03493
Update doc/internals.rst
aurghs Feb 4, 2021
22794d2
Update doc/internals.rst
aurghs Feb 4, 2021
67d2c1f
Update doc/internals.rst
aurghs Feb 4, 2021
f58f16b
Update doc/internals.rst
aurghs Feb 4, 2021
6a07a7c
Update doc/internals.rst
aurghs Feb 4, 2021
1285874
Update doc/internals.rst
aurghs Feb 4, 2021
112837d
Update doc/internals.rst
aurghs Feb 4, 2021
bdc46aa
Update doc/internals.rst
aurghs Feb 4, 2021
fe22048
Update doc/internals.rst
aurghs Feb 4, 2021
f492136
update section lazy laoding
aurghs Feb 3, 2021
fa7f212
Merge branch 'documentation-draft' of github.com:bopen/xarray into do…
aurghs Feb 4, 2021
b74c803
Merge branch 'master' into documentation-draft
aurghs Feb 4, 2021
de9432f
Merge remote-tracking branch 'origin/master' into documentation-draft
aurghs Feb 4, 2021
c50a95c
Update doc/internals.rst
aurghs Feb 4, 2021
ab62beb
Update doc/internals.rst
aurghs Feb 4, 2021
e470f36
Update doc/internals.rst
aurghs Feb 4, 2021
a777445
update internals.rst backend
aurghs Feb 4, 2021
87ed0fa
Merge branch 'documentation-draft' of github.com:bopen/xarray into do…
aurghs Feb 4, 2021
1f0870e
add lazy loading documentation
aurghs Feb 10, 2021
885a6bd
update example on indexing type
aurghs Feb 11, 2021
1381336
style
aurghs Feb 11, 2021
0ef410a
fix
aurghs Feb 11, 2021
dc36138
modify backend indexing doc
aurghs Feb 11, 2021
23e2423
fix
aurghs Feb 11, 2021
99ca49e
removed LazilyVectorizedIndexedArray from backend doc
aurghs Feb 11, 2021
b1eb077
small fix in doc
aurghs Feb 11, 2021
39bf16b
small fixes in backend doc
aurghs Feb 11, 2021
121c060
removed exmple vectorized indexing
aurghs Feb 12, 2021
e838d40
update documentation
aurghs Feb 25, 2021
8633e08
update documentation
aurghs Feb 25, 2021
3281345
Merge branch 'documentation-draft' of github.com:bopen/xarray into do…
aurghs Feb 25, 2021
992d47d
Merge remote-tracking branch 'origin/master' into documentation-draft
aurghs Feb 25, 2021
e3eb56d
isort
aurghs Feb 25, 2021
a456478
rename store_spec in filename_or_obj in guess_can_open
aurghs Feb 25, 2021
abf60e0
small update in backend documentation
aurghs Feb 25, 2021
e72ce9b
small update in backend documentation
aurghs Feb 25, 2021
7108f80
Update doc/internals.rst
aurghs Mar 3, 2021
e8499cd
Update doc/internals.rst
aurghs Mar 4, 2021
0955c16
fix backend documentation
aurghs Mar 4, 2021
54c202c
replace LazilyOuterIndexedArray with LazilyIndexedArray
aurghs Mar 4, 2021
3cc18d5
Update doc/internals.rst
aurghs Mar 5, 2021
9faf5e6
Update doc/internals.rst
alexamici Mar 8, 2021
06371df
Fix broken doc merge
alexamici Mar 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 197 additions & 0 deletions doc/internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -231,3 +231,200 @@ re-open it directly with Zarr:
zgroup = zarr.open("rasm.zarr")
print(zgroup.tree())
dict(zgroup["Tair"].attrs)


How to add a new backend
------------------------------------

Adding a new backend for read support to Xarray is easy, and does not require
aurghs marked this conversation as resolved.
Show resolved Hide resolved
to integrate any code in Xarray; all you need to do is approaching the
following steps:
aurghs marked this conversation as resolved.
Show resolved Hide resolved

- Create a class that inherits from Xarray py:class:`~xarray.backend.commonBackendEntrypoint`
aurghs marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

@keewis keewis Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Create a class that inherits from Xarray py:class:`~xarray.backend.commonBackendEntrypoint`
- Create a class that inherits from :py:class:`~xarray.backends.common.BackendEntrypoint`

also, I think I read somewhere that the official spelling of the package is xarray (I can't remember where, though). If that is still valid, I think we should make sure the documentation follows that.

- Implement the method ``open_dataset`` that returns an instance of :py:class:`~xarray.Dataset`
- Declare such a class as an external plugin in your setup.py.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we link to the entrypoints documentation here.

Suggested change
- Declare such a class as an external plugin in your setup.py.
- Declare such a class as an external plugin in your ``setup.py``.

EDIT: I see it's linked below. Won't hurt to add a link here also.

aurghs marked this conversation as resolved.
Show resolved Hide resolved

Your ``BackendEntrypoint`` sub-class is the primary interface with Xarray, and
it should implement the following attributes and functions:

- ``open_dataset`` (mandatory)
- [``open_dataset_parameters``] (optional)
- [``guess_can_open``] (optional)
aurghs marked this conversation as resolved.
Show resolved Hide resolved

These are detailed in the following.

open_dataset
++++++++++++

Inputs
^^^^^^

The backend ``open_dataset`` method takes as input one argument
aurghs marked this conversation as resolved.
Show resolved Hide resolved
(``filename``), and one keyword argument (``drop_variables``):

- ``filename``: can be a string containing a relative path or an instance of ``pathlib.Path``.
aurghs marked this conversation as resolved.
Show resolved Hide resolved
- ``drop_variables``: can be `None` or an iterable containing the variable names to be dropped when reading the data.

If it makes sense for your backend, your ``open_dataset`` method should
implement in its interface all the following boolean keyword arguments, called
**decoders** which default to ``None``:

- ``mask_and_scale=None``
aurghs marked this conversation as resolved.
Show resolved Hide resolved
- ``decode_times=None``
- ``decode_timedelta=None``
- ``use_cftime=None``
- ``concat_characters=None``
- ``decode_coords=None``

These keyword arguments are explicitly defined in Xarray
:py:meth:`~xarray.open_dataset` signature. Xarray will pass them to the
aurghs marked this conversation as resolved.
Show resolved Hide resolved
backend only if the User sets a value different from ``None`` explicitly.
Your backend can also take as input a set of backend-specific keyword
arguments. All these keyword arguments can be passed to
:py:meth:`~xarray.open_dataset` grouped either via the ``backend_kwarg``
aurghs marked this conversation as resolved.
Show resolved Hide resolved
parameter or explicitly using the syntax ``**kwargs``.

Output
^^^^^^
The output of the backend `open_dataset` shall be an instance of
Xarray py:class:`~xarray.Dataset` that implements the additional method ``close``,
aurghs marked this conversation as resolved.
Show resolved Hide resolved
used by Xarray to ensure the related files are eventually closed.

If you don't want to support the lazy loading, then the :py:class:`~xarray.Dataset`
shall contain ``NumPy.arrays`` and your work is almost done.
aurghs marked this conversation as resolved.
Show resolved Hide resolved

open_dataset_parameters
+++++++++++++++++++++++
``open_dataset_parameters`` is the list of backend ``open_dataset`` parameters.
It is not a mandatory parameter, and if the backend does not provide it
explicitly, Xarray creates a list of them automatically by inspecting the
backend signature.

Xarray uses ``open_dataset_parameters`` only when it needs to select
the **decoders** supported by the backend.

If ``open_dataset_parameters`` is not defined, but ``**kwargs`` and ``*args`` have
been passed to the signature, Xarray raises an error.
On the other hand, if the backend provides the ``open_dataset_parameters``,
then ``**kwargs`` and `*args`` can be used in the signature.
aurghs marked this conversation as resolved.
Show resolved Hide resolved

However, this practice is discouraged unless there is a good reasons for using
`**kwargs` or `*args`.
aurghs marked this conversation as resolved.
Show resolved Hide resolved

guess_can_open
++++++++++++++
``guess_can_open`` is used to identify the proper engine to open your data
file automatically in case the engine is not specified explicitly. If you are
not interested in supporting this feature, you can skip this step since
py:class:`~xarray.backend.common.BackendEntrypoint` already provides a default
aurghs marked this conversation as resolved.
Show resolved Hide resolved
py:meth:`~xarray.backend.common BackendEntrypoint.guess_engine` that always returns ``False``.
aurghs marked this conversation as resolved.
Show resolved Hide resolved

Backend ``guess_can_open`` takes as input the ``filename_or_obj`` parameter of
Xarray :py:meth:`~xarray.open_dataset`, and returns a boolean.


How to register a backend
+++++++++++++++++++++++++++

Define in your setup.py (or setup.cfg) a new entrypoint with:
aurghs marked this conversation as resolved.
Show resolved Hide resolved

- group: ``xarray.backend``
- name: the name to be passed to :py:meth:`~xarray.open_dataset` as ``engine``.
- object reference: the reference of the class that you have implemented.

aurghs marked this conversation as resolved.
Show resolved Hide resolved
See https://packaging.python.org/specifications/entry-points/#data-model
for more information

alexamici marked this conversation as resolved.
Show resolved Hide resolved
How to support Lazy Loading
+++++++++++++++++++++++++++
If you want to make your backend effective with big datasets, then you should support
the lazy loading.
aurghs marked this conversation as resolved.
Show resolved Hide resolved
Basically, you shall replace the :py:class:`numpy.array` inside the variables with
a custom class:

.. ipython:: python
backend_array = YourBackendArray()
aurghs marked this conversation as resolved.
Show resolved Hide resolved
data = indexing.LazilyOuterIndexedArray(backend_array)
variable = Variable(..., data, ...)

Where ``YourBackendArray``is a class that inherit from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an example YourBackendArray in the beginning that shows the signature for open_dataset, the optional attributes, and the use of optional kwargs? It would make this document easier to understand

:py:class:`~xarray.backends.common.BackendArray` and
:py:class:`~xarray.core.indexing.LazilyOuterIndexedArray` is a
class of Xarray that wraps an array to make basic and outer indexing lazy.

BackendArray
^^^^^^^^^^^^

CachingFileManager
^^^^^^^^^^^^^^^^^^


Dask chunking
+++++++++++++
The backend is not directly involved in `Dask <http://dask.pydata.org/>`__ chunking, since it is managed
internally by Xarray. However, the backend can define the preferred chunk size
inside the variable’s encoding ``var.encoding["preferred_chunks"]``.
The ``preferred_chunks`` may be useful to improve performances with lazy loading.
``preferred_chunks`` shall be a dictionary specifying chunk size per dimension
like ``{“dim1”: 1000, “dim2”: 2000}`` or
``{“dim1”: [1000, 100], “dim2”: [2000, 2000, 2000]]}``.

The ``preferred_chunks`` is used by Xarray to define the chunk size in some
special cases:

- If ``chunks`` along a dimension is None or not defined
aurghs marked this conversation as resolved.
Show resolved Hide resolved
- If ``chunks`` is “auto”
aurghs marked this conversation as resolved.
Show resolved Hide resolved

In the first case Xarray uses the chunks size specified in
``preferred_chunks``.
In the second case Xarray accommodates ideal chunk sizes, preserving if
possible the "preferred_chunks". The ideal chunk size is computed using
``dask.core.normalize function``, setting ``previus_chunks = preferred_chunks``.
aurghs marked this conversation as resolved.
Show resolved Hide resolved


Decoders
++++++++
The decoders implement specific operations to transform data from on-disk
representation to Xarray representation.

A classic example is the “time” variable decoding operation. In NetCDF, the
elements of the “time” variable are stored as integers, and the unit contains
an origin (for example: "seconds since 1970-1-1"). In this case, Xarray
transforms the pair integer-unit in a ``np.datetimes``.

The standard decoders implemented in Xarray are:
- strings.CharacterArrayCoder()
aurghs marked this conversation as resolved.
Show resolved Hide resolved
- strings.EncodedStringCoder()
- variables.UnsignedIntegerCoder()
- variables.CFMaskCoder()
- variables.CFScaleOffsetCoder()
- times.CFTimedeltaCoder()
- times.CFDatetimeCoder()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful to add a discussion here on how a raw key is transformed to an ExplicitIndexer when querying an xarray DataArray with a third-party backend. This would be especially useful for picking good test cases for querying DataArrays and debugging problems in edge-cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to add something about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at what to add here. I started writing some more details, but I think It would become a little bit too long.
Today we should merge. So I suggest to merge it as it is, and I'll add later a dedicated section.

Some of the transformations can be common to more backends, so before
implementing a new decoder, be sure Xarray does not already implement that one.

The backends can reuse Xarray’s decoders, either instantiating the decoders
directly or using the higher-level function
aurghs marked this conversation as resolved.
Show resolved Hide resolved
:py:func:`~xarray.conventions.decode_cf_variables` that groups Xarray decoders.

In some cases, the transformation to apply strongly depends on the on-disk
data format. Therefore, you may need to implement your decoder.

An example of such a case is when you have to deal with the time format of a
grib file. grib format is very different from the NetCDF one: in grib, the
time is stored in two attributes dataDate and dataTime as strings. Therefore,
it is not possible to reuse the Xarray time decoder, and implementing a new
one is mandatory.

Decoders can be activated or deactivated using the boolean keywords of
:py:meth:`~xarray.open_dataset` signature: ``mask_and_scale``,
``decode_times``, ``decode_timedelta``, ``use_cftime``,
``concat_characters``, ``decode_coords``.

Such keywords are passed to the backend only if the User sets a value
different from ``None``. Note that the backend does not necessarily have to
implement all the decoders, but it shall declare in its ``open_dataset``
interface only the boolean keywords related to the supported decoders. The
backend shall implement the deactivation and activation of the supported
decoders.