Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose apply_ufunc as public API and add documentation #1619

Merged
merged 7 commits into from
Oct 20, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Top-level functions
.. autosummary::
:toctree: generated/

apply_ufunc
align
broadcast
concat
Expand Down
104 changes: 99 additions & 5 deletions doc/computation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -303,15 +303,21 @@ Datasets support most of the same methods found on data arrays:
ds.mean(dim='x')
abs(ds)

Unfortunately, a limitation of the current version of numpy means that we
cannot override ufuncs for datasets, because datasets cannot be written as
a single array [1]_. :py:meth:`~xarray.Dataset.apply` works around this
Unfortunately, we current do not support NUmPy ufuncs for datasets [1]_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently

:py:meth:`~xarray.Dataset.apply` works around this
limitation, by applying the given function to each variable in the dataset:

.. ipython:: python

ds.apply(np.sin)

You can also use the wrapped functions in the ``xarray.ufuncs`` module:

.. ipython:: python

import xarray.ufuncs as xu
xu.sin(ds)

Datasets also use looping over variables for *broadcasting* in binary
arithmetic. You can do arithmetic between any ``DataArray`` and a dataset:

Expand All @@ -329,5 +335,93 @@ Arithmetic between two datasets matches data variables of the same name:
Similarly to index based alignment, the result has the intersection of all
matching data variables.

.. [1] In some future version of NumPy, we should be able to override ufuncs for
datasets by making use of ``__numpy_ufunc__``.
.. [1] This was previously due to a limitation of NumPy, but with NumPy 1.13
we should be able to support this by leveraging ``__array_ufunc__``
(:issue:`1617`).

.. computation.wrapping-custom:

Wrapping custom computation
===========================

It doesn't always make sense to do computation directly with xarray objects:

- When working with small arrays (less than ~1e7 elements), applying an
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the point on speed is a distraction? If an array is that small, the absolute difference in speed is still very small, so it really only makes a difference if you're doing those operations in a loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I had "in the inner loop" in mind when I wrote this, but I see now that that never made it into the text. Let me know if this latest update seems more reasonable to you, or if I'm still pushing too hard on performance considerations.

operation with xarray can be significantly slower. Keeping track of labels
and ensuring their consistency adds overhead, and xarray's high level label-
based APIs remove low-level control over the implementation of operations.
Also, xarray's core itself is not especially fast, because it's written in
Python rather than a compiled language like C.
- Even if speed doesn't matter, it can be important to wrap existing code, or
to support alternative interfaces that don't use xarray objects.

For these reasons, it is often well-advised to write low-level routines that
work with NumPy arrays, and to wrap these routines to work with xarray objects.
However, adding support for labels on both :py:class:`~xarray.Dataset` and
:py:class:`~xarray.DataArray` can be a bit of a chore.

To make this easier, xarray supplies the :py:func:`~xarray.apply_ufunc` helper
function, designed for wrapping functions that support broadcasting and
vectorization on unlabeled arrays in the style of a NumPy
`universal function <https://docs.scipy.org/doc/numpy-1.13.0/reference/ufuncs.html>`_ ("ufunc" for short).
``apply_ufunc`` takes care of everything needed for an idiomatic xarray wrapper,
including alignment, broadcasting, looping over ``Dataset`` variables (if
needed), and merging of coordinates. In fact, many internal xarray
functions/methods are written using ``apply_ufunc``.

Simple functions that act independently on each value should work without
any additional arguments:

.. ipython:: python

squared_error = lambda x, y: (x - y) ** 2
arr1 = xr.DataArray([0, 1, 2, 3], dims='x')
xr.apply_func(squared_error, arr1, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xr.apply_ufunc


For using more complex operations that consider some array values collectively,
it's important to understand the idea of "core dimensions" from NumPy's
`generalized ufuncs <http://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html>`_. Core dimensions are defined as dimensions
that should *not* be broadcast over. Usually, they correspond to the fundamental
dimensions over which an operation is defined, e.g., the summed axis in
``np.sum``. A good clue that core dimensions are needed is the presence of an
``axis`` argument on the corresponding NumPy function.

With ``apply_ufunc``, core dimensions are recognized by name, and then moved to
the last dimension of any input arguments before applying the given function.
This means that for functions that accept an ``axis`` argument, you usually need
to set ``axis=-1``. As an example, here is how we would wrap
:py:func:`numpy.linalg.norm` to calculate the vector norm:

.. code-block:: python

def vector_norm(x, dim, ord=None):
return xr.apply_ufunc(np.linalg.norm, x,
input_core_dims=[[dim]],
kwargs={'ord': ord, 'axis': -1})

.. ipython:: python
:suppress:

def vector_norm(x, dim, ord=None):
return xr.apply_ufunc(np.linalg.norm, x,
input_core_dims=[[dim]],
kwargs={'ord': ord, 'axis': -1})

.. ipython:: python

vector_norm(arr1, dim='x')

Because ``apply_ufunc`` follows a standard convention for ufuncs, it plays
nicely with tools for building vectorized functions, like
:func:`numpy.broadcast_arrays` and :func:`numpy.vectorize`. For high performance
needs, consider using Numba's `vectorize and guvectorize <http://numba.pydata.org/numba-doc/dev/user/vectorize.html>`_.

In addition to wrapping functions, ``apply_ufunc`` can automatically parallelize
many functions when using dask by setting ``dask='parallelized'``. This is
illustrated in a separate example.

.. TODO: add link!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking the recipes would be a good place for this: http://xarray.pydata.org/en/stable/auto_gallery/index.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if I can come up with a figure!


:py:func:`~xarray.apply_ufunc` also supports some advanced options for
controlling alignment of variables and the form of the result. See the
docstring for full details and more examples.
7 changes: 6 additions & 1 deletion doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,11 @@ Backward Incompatible Changes
Enhancements
~~~~~~~~~~~~

- New helper function :py:func:`~xarray.apply_ufunc` for wrapping functions
written to work on NumPy arrays to support labels on xarray objects.
``apply_ufunc`` also support automatic parallelization for many functions
with dask. See :ref:`computation.wrapping-custom` for details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget:

(:issue:`XXXX`). By `Stephan Hoyer <https://github.com/shoyer>`_. 

- Support for `pathlib.Path` objects added to
:py:func:`~xarray.open_dataset`, :py:func:`~xarray.open_mfdataset`,
:py:func:`~xarray.to_netcdf`, and :py:func:`~xarray.save_mfdataset`
Expand Down Expand Up @@ -232,7 +237,7 @@ Bug fixes
The previous behavior unintentionally causing additional tests to be skipped
(:issue:`1531`). By `Joe Hamman <https://github.com/jhamman>`_.

- Fix pynio backend for upcoming release of pynio with python3 support
- Fix pynio backend for upcoming release of pynio with python3 support
(:issue:`1611`). By `Ben Hillman <https://github/brhillman>`_.

.. _whats-new.0.9.6:
Expand Down
2 changes: 1 addition & 1 deletion xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from .core.alignment import align, broadcast, broadcast_arrays
from .core.common import full_like, zeros_like, ones_like
from .core.combine import concat, auto_combine
from .core.computation import where
from .core.computation import apply_ufunc, where
from .core.extensions import (register_dataarray_accessor,
register_dataset_accessor)
from .core.variable import as_variable, Variable, IndexVariable, Coordinate
Expand Down