Skip to content
forked from pydata/xarray

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into fix-sparse
Browse files Browse the repository at this point in the history
* upstream/master:
  Improve interp performance (pydata#4069)
  Auto chunk (pydata#4064)
  xr.cov() and xr.corr() (pydata#4089)
  allow multiindex levels in plots (pydata#3938)
  Fix bool weights (pydata#4075)
  fix dangerous default arguments (pydata#4006)
  • Loading branch information
dcherian committed May 25, 2020
2 parents ba7b47a + d1f7cb8 commit c9a4205
Show file tree
Hide file tree
Showing 18 changed files with 598 additions and 50 deletions.
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ Top-level functions
full_like
zeros_like
ones_like
cov
corr
dot
polyval
map_blocks
Expand Down
40 changes: 39 additions & 1 deletion doc/plotting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ labels can also be used to easily create informative plots.
xarray's plotting capabilities are centered around
:py:class:`DataArray` objects.
To plot :py:class:`Dataset` objects
simply access the relevant DataArrays, ie ``dset['var1']``.
simply access the relevant DataArrays, i.e. ``dset['var1']``.
Dataset specific plotting routines are also available (see :ref:`plot-dataset`).
Here we focus mostly on arrays 2d or larger. If your data fits
nicely into a pandas DataFrame then you're better off using one of the more
Expand Down Expand Up @@ -209,6 +209,44 @@ entire figure (as for matplotlib's ``figsize`` argument).

.. _plotting.multiplelines:

=========================
Determine x-axis values
=========================

Per default dimension coordinates are used for the x-axis (here the time coordinates).
However, you can also use non-dimension coordinates, MultiIndex levels, and dimensions
without coordinates along the x-axis. To illustrate this, let's calculate a 'decimal day' (epoch)
from the time and assign it as a non-dimension coordinate:

.. ipython:: python
decimal_day = (air1d.time - air1d.time[0]) / pd.Timedelta('1d')
air1d_multi = air1d.assign_coords(decimal_day=("time", decimal_day))
air1d_multi
To use ``'decimal_day'`` as x coordinate it must be explicitly specified:

.. ipython:: python
air1d_multi.plot(x="decimal_day")
Creating a new MultiIndex named ``'date'`` from ``'time'`` and ``'decimal_day'``,
it is also possible to use a MultiIndex level as x-axis:

.. ipython:: python
air1d_multi = air1d_multi.set_index(date=("time", "decimal_day"))
air1d_multi.plot(x="decimal_day")
Finally, if a dataset does not have any coordinates it enumerates all data points:

.. ipython:: python
air1d_multi = air1d_multi.drop("date")
air1d_multi.plot()
The same applies to 2D plots below.

====================================================
Multiple lines showing variation along a dimension
====================================================
Expand Down
19 changes: 18 additions & 1 deletion doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,21 @@ Breaking changes
(:pull:`3274`)
By `Elliott Sales de Andrade <https://github.com/QuLogic>`_

Enhancements
~~~~~~~~~~~~
- Performance improvement of :py:meth:`DataArray.interp` and :py:func:`Dataset.interp`
For orthogonal linear- and nearest-neighbor interpolation, we do 1d-interpolation sequentially
rather than interpolating in multidimensional space. (:issue:`2223`)
By `Keisuke Fujii <https://github.com/fujiisoup>`_.

New Features
~~~~~~~~~~~~

- ``chunks='auto'`` is now supported in the ``chunks`` argument of
:py:meth:`Dataset.chunk`. (:issue:`4055`)
By `Andrew Williams <https://github.com/AndrewWilliams3142>`_
- Added :py:func:`xarray.cov` and :py:func:`xarray.corr` (:issue:`3784`, :pull:`3550`, :pull:`4089`).
By `Andrew Williams <https://github.com/AndrewWilliams3142>`_ and `Robin Beer <https://github.com/r-beer>`_.
- Added :py:meth:`DataArray.polyfit` and :py:func:`xarray.polyval` for fitting polynomials. (:issue:`3349`)
By `Pascal Bourgault <https://github.com/aulemahal>`_.
- Control over attributes of result in :py:func:`merge`, :py:func:`concat`,
Expand Down Expand Up @@ -63,6 +76,8 @@ New Features
By `Stephan Hoyer <https://github.com/shoyer>`_.
- Allow plotting of boolean arrays. (:pull:`3766`)
By `Marek Jacob <https://github.com/MeraX>`_
- Enable using MultiIndex levels as cordinates in 1D and 2D plots (:issue:`3927`).
By `Mathias Hauser <https://github.com/mathause>`_.
- A ``days_in_month`` accessor for :py:class:`xarray.CFTimeIndex`, analogous to
the ``days_in_month`` accessor for a :py:class:`pandas.DatetimeIndex`, which
returns the days in the month each datetime in the index. Now days in month
Expand Down Expand Up @@ -121,6 +136,8 @@ Bug fixes
- Fix bug in time parsing failing to fall back to cftime. This was causing time
variables with a time unit of `'msecs'` to fail to parse. (:pull:`3998`)
By `Ryan May <https://github.com/dopplershift>`_.
- Fix weighted mean when passing boolean weights (:issue:`4074`).
By `Mathias Hauser <https://github.com/mathause>`_.
- Fix html repr in untrusted notebooks: fallback to plain text repr. (:pull:`4053`)
By `Benoit Bovy <https://github.com/benbovy>`_.

Expand Down Expand Up @@ -188,7 +205,7 @@ New Features

- Weighted array reductions are now supported via the new :py:meth:`DataArray.weighted`
and :py:meth:`Dataset.weighted` methods. See :ref:`comput.weighted`. (:issue:`422`, :pull:`2922`).
By `Mathias Hauser <https://github.com/mathause>`_
By `Mathias Hauser <https://github.com/mathause>`_.
- The new jupyter notebook repr (``Dataset._repr_html_`` and
``DataArray._repr_html_``) (introduced in 0.14.1) is now on by default. To
disable, use ``xarray.set_options(display_style="text")``.
Expand Down
4 changes: 3 additions & 1 deletion xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from .core.alignment import align, broadcast
from .core.combine import auto_combine, combine_by_coords, combine_nested
from .core.common import ALL_DIMS, full_like, ones_like, zeros_like
from .core.computation import apply_ufunc, dot, polyval, where
from .core.computation import apply_ufunc, corr, cov, dot, polyval, where
from .core.concat import concat
from .core.dataarray import DataArray
from .core.dataset import Dataset
Expand Down Expand Up @@ -54,6 +54,8 @@
"concat",
"decode_cf",
"dot",
"cov",
"corr",
"full_like",
"load_dataarray",
"load_dataset",
Expand Down
180 changes: 179 additions & 1 deletion xarray/core/computation.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
import numpy as np

from . import dtypes, duck_array_ops, utils
from .alignment import deep_align
from .alignment import align, deep_align
from .merge import merge_coordinates_without_align
from .options import OPTIONS
from .pycompat import dask_array_type
Expand Down Expand Up @@ -1069,6 +1069,184 @@ def earth_mover_distance(first_samples,
return apply_array_ufunc(func, *args, dask=dask)


def cov(da_a, da_b, dim=None, ddof=1):
"""
Compute covariance between two DataArray objects along a shared dimension.
Parameters
----------
da_a: DataArray object
Array to compute.
da_b: DataArray object
Array to compute.
dim : str, optional
The dimension along which the covariance will be computed
ddof: int, optional
If ddof=1, covariance is normalized by N-1, giving an unbiased estimate,
else normalization is by N.
Returns
-------
covariance: DataArray
See also
--------
pandas.Series.cov: corresponding pandas function
xr.corr: respective function to calculate correlation
Examples
--------
>>> da_a = DataArray(np.array([[1, 2, 3], [0.1, 0.2, 0.3], [3.2, 0.6, 1.8]]),
... dims=("space", "time"),
... coords=[('space', ['IA', 'IL', 'IN']),
... ('time', pd.date_range("2000-01-01", freq="1D", periods=3))])
>>> da_a
<xarray.DataArray (space: 3, time: 3)>
array([[1. , 2. , 3. ],
[0.1, 0.2, 0.3],
[3.2, 0.6, 1.8]])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
>>> da_a = DataArray(np.array([[0.2, 0.4, 0.6], [15, 10, 5], [3.2, 0.6, 1.8]]),
... dims=("space", "time"),
... coords=[('space', ['IA', 'IL', 'IN']),
... ('time', pd.date_range("2000-01-01", freq="1D", periods=3))])
>>> da_b
<xarray.DataArray (space: 3, time: 3)>
array([[ 0.2, 0.4, 0.6],
[15. , 10. , 5. ],
[ 3.2, 0.6, 1.8]])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
>>> xr.cov(da_a, da_b)
<xarray.DataArray ()>
array(-3.53055556)
>>> xr.cov(da_a, da_b, dim='time')
<xarray.DataArray (space: 3)>
array([ 0.2, -0.5, 1.69333333])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
"""
from .dataarray import DataArray

if any(not isinstance(arr, DataArray) for arr in [da_a, da_b]):
raise TypeError(
"Only xr.DataArray is supported."
"Given {}.".format([type(arr) for arr in [da_a, da_b]])
)

return _cov_corr(da_a, da_b, dim=dim, ddof=ddof, method="cov")


def corr(da_a, da_b, dim=None):
"""
Compute the Pearson correlation coefficient between
two DataArray objects along a shared dimension.
Parameters
----------
da_a: DataArray object
Array to compute.
da_b: DataArray object
Array to compute.
dim: str, optional
The dimension along which the correlation will be computed
Returns
-------
correlation: DataArray
See also
--------
pandas.Series.corr: corresponding pandas function
xr.cov: underlying covariance function
Examples
--------
>>> da_a = DataArray(np.array([[1, 2, 3], [0.1, 0.2, 0.3], [3.2, 0.6, 1.8]]),
... dims=("space", "time"),
... coords=[('space', ['IA', 'IL', 'IN']),
... ('time', pd.date_range("2000-01-01", freq="1D", periods=3))])
>>> da_a
<xarray.DataArray (space: 3, time: 3)>
array([[1. , 2. , 3. ],
[0.1, 0.2, 0.3],
[3.2, 0.6, 1.8]])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
>>> da_a = DataArray(np.array([[0.2, 0.4, 0.6], [15, 10, 5], [3.2, 0.6, 1.8]]),
... dims=("space", "time"),
... coords=[('space', ['IA', 'IL', 'IN']),
... ('time', pd.date_range("2000-01-01", freq="1D", periods=3))])
>>> da_b
<xarray.DataArray (space: 3, time: 3)>
array([[ 0.2, 0.4, 0.6],
[15. , 10. , 5. ],
[ 3.2, 0.6, 1.8]])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
>>> xr.corr(da_a, da_b)
<xarray.DataArray ()>
array(-0.57087777)
>>> xr.corr(da_a, da_b, dim='time')
<xarray.DataArray (space: 3)>
array([ 1., -1., 1.])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
"""
from .dataarray import DataArray

if any(not isinstance(arr, DataArray) for arr in [da_a, da_b]):
raise TypeError(
"Only xr.DataArray is supported."
"Given {}.".format([type(arr) for arr in [da_a, da_b]])
)

return _cov_corr(da_a, da_b, dim=dim, method="corr")


def _cov_corr(da_a, da_b, dim=None, ddof=0, method=None):
"""
Internal method for xr.cov() and xr.corr() so only have to
sanitize the input arrays once and we don't repeat code.
"""
# 1. Broadcast the two arrays
da_a, da_b = align(da_a, da_b, join="inner", copy=False)

# 2. Ignore the nans
valid_values = da_a.notnull() & da_b.notnull()

if not valid_values.all():
da_a = da_a.where(valid_values)
da_b = da_b.where(valid_values)

valid_count = valid_values.sum(dim) - ddof

# 3. Detrend along the given dim
demeaned_da_a = da_a - da_a.mean(dim=dim)
demeaned_da_b = da_b - da_b.mean(dim=dim)

# 4. Compute covariance along the given dim
# N.B. `skipna=False` is required or there is a bug when computing
# auto-covariance. E.g. Try xr.cov(da,da) for
# da = xr.DataArray([[1, 2], [1, np.nan]], dims=["x", "time"])
cov = (demeaned_da_a * demeaned_da_b).sum(dim=dim, skipna=False) / (valid_count)

if method == "cov":
return cov

else:
# compute std + corr
da_a_std = da_a.std(dim=dim)
da_b_std = da_b.std(dim=dim)
corr = cov / (da_a_std * da_b_std)
return corr


def dot(*arrays, dims=None, **kwargs):
"""Generalized dot product for xarray objects. Like np.einsum, but
provides a simpler interface based on array dimensions.
Expand Down
9 changes: 6 additions & 3 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1707,7 +1707,10 @@ def chunks(self) -> Mapping[Hashable, Tuple[int, ...]]:
def chunk(
self,
chunks: Union[
None, Number, Mapping[Hashable, Union[None, Number, Tuple[Number, ...]]]
None,
Number,
str,
Mapping[Hashable, Union[None, Number, str, Tuple[Number, ...]]],
] = None,
name_prefix: str = "xarray-",
token: str = None,
Expand All @@ -1725,7 +1728,7 @@ def chunk(
Parameters
----------
chunks : int or mapping, optional
chunks : int, 'auto' or mapping, optional
Chunk sizes along each dimension, e.g., ``5`` or
``{'x': 5, 'y': 5}``.
name_prefix : str, optional
Expand All @@ -1742,7 +1745,7 @@ def chunk(
"""
from dask.base import tokenize

if isinstance(chunks, Number):
if isinstance(chunks, (Number, str)):
chunks = dict.fromkeys(self.dims, chunks)

if chunks is not None:
Expand Down
15 changes: 14 additions & 1 deletion xarray/core/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -619,6 +619,19 @@ def interp(var, indexes_coords, method, **kwargs):
# default behavior
kwargs["bounds_error"] = kwargs.get("bounds_error", False)

# check if the interpolation can be done in orthogonal manner
if (
len(indexes_coords) > 1
and method in ["linear", "nearest"]
and all(dest[1].ndim == 1 for dest in indexes_coords.values())
and len(set([d[1].dims[0] for d in indexes_coords.values()]))
== len(indexes_coords)
):
# interpolate sequentially
for dim, dest in indexes_coords.items():
var = interp(var, {dim: dest}, method, **kwargs)
return var

# target dimensions
dims = list(indexes_coords)
x, new_x = zip(*[indexes_coords[d] for d in dims])
Expand Down Expand Up @@ -659,7 +672,7 @@ def interp_func(var, x, new_x, method, kwargs):
New coordinates. Should not contain NaN.
method: string
{'linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic'} for
1-dimensional itnterpolation.
1-dimensional interpolation.
{'linear', 'nearest'} for multidimensional interpolation
**kwargs:
Optional keyword arguments to be passed to scipy.interpolator
Expand Down
9 changes: 8 additions & 1 deletion xarray/core/weighted.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,14 @@ def _sum_of_weights(
# we need to mask data values that are nan; else the weights are wrong
mask = da.notnull()

sum_of_weights = self._reduce(mask, self.weights, dim=dim, skipna=False)
# bool -> int, because ``xr.dot([True, True], [True, True])`` -> True
# (and not 2); GH4074
if self.weights.dtype == bool:
sum_of_weights = self._reduce(
mask, self.weights.astype(int), dim=dim, skipna=False
)
else:
sum_of_weights = self._reduce(mask, self.weights, dim=dim, skipna=False)

# 0-weights are not valid
valid_weights = sum_of_weights != 0.0
Expand Down
Loading

0 comments on commit c9a4205

Please sign in to comment.