Merge remote-tracking branch 'upstream/master' into fix-sparse

* upstream/master: Improve interp performance (pydata#4069) Auto chunk (pydata#4064) xr.cov() and xr.corr() (pydata#4089) allow multiindex levels in plots (pydata#3938) Fix bool weights (pydata#4075) fix dangerous default arguments (pydata#4006)
dcherian · May 25, 2020 · c9a4205 · c9a4205
2 parents ba7b47a + d1f7cb8
commit c9a4205
Show file tree

Hide file tree

Showing 18 changed files with 598 additions and 50 deletions.
diff --git a/doc/api.rst b/doc/api.rst
@@ -29,6 +29,8 @@ Top-level functions
    full_like
    zeros_like
    ones_like
+   cov
+   corr
    dot
    polyval
    map_blocks

diff --git a/doc/plotting.rst b/doc/plotting.rst
@@ -13,7 +13,7 @@ labels can also be used to easily create informative plots.
 xarray's plotting capabilities are centered around
 :py:class:`DataArray` objects.
 To plot :py:class:`Dataset` objects
-simply access the relevant DataArrays, ie ``dset['var1']``.
+simply access the relevant DataArrays, i.e. ``dset['var1']``.
 Dataset specific plotting routines are also available (see :ref:`plot-dataset`).
 Here we focus mostly on arrays 2d or larger. If your data fits
 nicely into a pandas DataFrame then you're better off using one of the more
@@ -209,6 +209,44 @@ entire figure (as for matplotlib's ``figsize`` argument).
 
 .. _plotting.multiplelines:
 
+=========================
+ Determine x-axis values
+=========================
+
+Per default dimension coordinates are used for the x-axis (here the time coordinates).
+However, you can also use non-dimension coordinates, MultiIndex levels, and dimensions
+without coordinates along the x-axis. To illustrate this, let's calculate a 'decimal day' (epoch)
+from the time and assign it as a non-dimension coordinate:
+
+.. ipython:: python
+
+    decimal_day = (air1d.time - air1d.time[0]) /  pd.Timedelta('1d')
+    air1d_multi = air1d.assign_coords(decimal_day=("time", decimal_day))
+    air1d_multi
+
+To use ``'decimal_day'`` as x coordinate it must be explicitly specified:
+
+.. ipython:: python
+
+    air1d_multi.plot(x="decimal_day")
+
+Creating a new MultiIndex named ``'date'`` from ``'time'`` and ``'decimal_day'``,
+it is also possible to use a MultiIndex level as x-axis:
+
+.. ipython:: python
+
+    air1d_multi = air1d_multi.set_index(date=("time", "decimal_day"))
+    air1d_multi.plot(x="decimal_day")
+
+Finally, if a dataset does not have any coordinates it enumerates all data points:
+
+.. ipython:: python
+
+    air1d_multi = air1d_multi.drop("date")
+    air1d_multi.plot()
+
+The same applies to 2D plots below.
+
 ====================================================
  Multiple lines showing variation along a dimension
 ====================================================

diff --git a/doc/whats-new.rst b/doc/whats-new.rst
@@ -34,8 +34,21 @@ Breaking changes
   (:pull:`3274`)
   By `Elliott Sales de Andrade <https://github.com/QuLogic>`_
 
+Enhancements
+~~~~~~~~~~~~
+- Performance improvement of :py:meth:`DataArray.interp` and :py:func:`Dataset.interp` 
+  For orthogonal linear- and nearest-neighbor interpolation, we do 1d-interpolation sequentially 
+  rather than interpolating in multidimensional space. (:issue:`2223`)
+  By `Keisuke Fujii <https://github.com/fujiisoup>`_.
+
 New Features
 ~~~~~~~~~~~~
+
+- ``chunks='auto'`` is now supported in the ``chunks`` argument of
+  :py:meth:`Dataset.chunk`. (:issue:`4055`)
+  By `Andrew Williams <https://github.com/AndrewWilliams3142>`_ 
+- Added :py:func:`xarray.cov` and :py:func:`xarray.corr` (:issue:`3784`, :pull:`3550`, :pull:`4089`).
+  By `Andrew Williams <https://github.com/AndrewWilliams3142>`_ and `Robin Beer <https://github.com/r-beer>`_.
 - Added :py:meth:`DataArray.polyfit` and :py:func:`xarray.polyval` for fitting polynomials. (:issue:`3349`)
   By `Pascal Bourgault <https://github.com/aulemahal>`_.
 - Control over attributes of result in :py:func:`merge`, :py:func:`concat`,
@@ -63,6 +76,8 @@ New Features
   By `Stephan Hoyer <https://github.com/shoyer>`_.
 - Allow plotting of boolean arrays. (:pull:`3766`)
   By `Marek Jacob <https://github.com/MeraX>`_
+- Enable using MultiIndex levels as cordinates in 1D and 2D plots (:issue:`3927`). 
+  By `Mathias Hauser <https://github.com/mathause>`_.
 - A ``days_in_month`` accessor for :py:class:`xarray.CFTimeIndex`, analogous to
   the ``days_in_month`` accessor for a :py:class:`pandas.DatetimeIndex`, which
   returns the days in the month each datetime in the index.  Now days in month
@@ -121,6 +136,8 @@ Bug fixes
 - Fix bug in time parsing failing to fall back to cftime. This was causing time
   variables with a time unit of `'msecs'` to fail to parse. (:pull:`3998`)
   By `Ryan May <https://github.com/dopplershift>`_.
+- Fix weighted mean when passing boolean weights (:issue:`4074`).
+  By `Mathias Hauser <https://github.com/mathause>`_.
 - Fix html repr in untrusted notebooks: fallback to plain text repr. (:pull:`4053`)
   By `Benoit Bovy <https://github.com/benbovy>`_.
 
@@ -188,7 +205,7 @@ New Features
 
 - Weighted array reductions are now supported via the new :py:meth:`DataArray.weighted`
   and :py:meth:`Dataset.weighted` methods. See :ref:`comput.weighted`. (:issue:`422`, :pull:`2922`).
-  By `Mathias Hauser <https://github.com/mathause>`_
+  By `Mathias Hauser <https://github.com/mathause>`_.
 - The new jupyter notebook repr (``Dataset._repr_html_`` and
   ``DataArray._repr_html_``) (introduced in 0.14.1) is now on by default. To
   disable, use ``xarray.set_options(display_style="text")``.

diff --git a/xarray/__init__.py b/xarray/__init__.py
@@ -17,7 +17,7 @@
 from .core.alignment import align, broadcast
 from .core.combine import auto_combine, combine_by_coords, combine_nested
 from .core.common import ALL_DIMS, full_like, ones_like, zeros_like
-from .core.computation import apply_ufunc, dot, polyval, where
+from .core.computation import apply_ufunc, corr, cov, dot, polyval, where
 from .core.concat import concat
 from .core.dataarray import DataArray
 from .core.dataset import Dataset
@@ -54,6 +54,8 @@
     "concat",
     "decode_cf",
     "dot",
+    "cov",
+    "corr",
     "full_like",
     "load_dataarray",
     "load_dataset",

diff --git a/xarray/core/computation.py b/xarray/core/computation.py
@@ -24,7 +24,7 @@
 import numpy as np
 
 from . import dtypes, duck_array_ops, utils
-from .alignment import deep_align
+from .alignment import align, deep_align
 from .merge import merge_coordinates_without_align
 from .options import OPTIONS
 from .pycompat import dask_array_type
@@ -1069,6 +1069,184 @@ def earth_mover_distance(first_samples,
         return apply_array_ufunc(func, *args, dask=dask)
 
 
+def cov(da_a, da_b, dim=None, ddof=1):
+    """
+    Compute covariance between two DataArray objects along a shared dimension.
+
+    Parameters
+    ----------
+    da_a: DataArray object
+        Array to compute.
+    da_b: DataArray object
+        Array to compute.
+    dim : str, optional
+        The dimension along which the covariance will be computed
+    ddof: int, optional
+        If ddof=1, covariance is normalized by N-1, giving an unbiased estimate,
+        else normalization is by N.
+
+    Returns
+    -------
+    covariance: DataArray
+
+    See also
+    --------
+    pandas.Series.cov: corresponding pandas function
+    xr.corr: respective function to calculate correlation
+
+    Examples
+    --------
+    >>> da_a = DataArray(np.array([[1, 2, 3], [0.1, 0.2, 0.3], [3.2, 0.6, 1.8]]),
+    ...                  dims=("space", "time"),
+    ...                  coords=[('space', ['IA', 'IL', 'IN']),
+    ...                          ('time', pd.date_range("2000-01-01", freq="1D", periods=3))])
+    >>> da_a
+    <xarray.DataArray (space: 3, time: 3)>
+    array([[1. , 2. , 3. ],
+           [0.1, 0.2, 0.3],
+           [3.2, 0.6, 1.8]])
+    Coordinates:
+      * space    (space) <U2 'IA' 'IL' 'IN'
+      * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
+    >>> da_a = DataArray(np.array([[0.2, 0.4, 0.6], [15, 10, 5], [3.2, 0.6, 1.8]]),
+    ...                  dims=("space", "time"),
+    ...                  coords=[('space', ['IA', 'IL', 'IN']),
+    ...                          ('time', pd.date_range("2000-01-01", freq="1D", periods=3))])
+    >>> da_b
+    <xarray.DataArray (space: 3, time: 3)>
+    array([[ 0.2,  0.4,  0.6],
+           [15. , 10. ,  5. ],
+           [ 3.2,  0.6,  1.8]])
+    Coordinates:
+      * space    (space) <U2 'IA' 'IL' 'IN'
+      * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
+    >>> xr.cov(da_a, da_b)
+    <xarray.DataArray ()>
+    array(-3.53055556)
+    >>> xr.cov(da_a, da_b, dim='time')
+    <xarray.DataArray (space: 3)>
+    array([ 0.2, -0.5,  1.69333333])
+    Coordinates:
+      * space    (space) <U2 'IA' 'IL' 'IN'
+    """
+    from .dataarray import DataArray
+
+    if any(not isinstance(arr, DataArray) for arr in [da_a, da_b]):
+        raise TypeError(
+            "Only xr.DataArray is supported."
+            "Given {}.".format([type(arr) for arr in [da_a, da_b]])
+        )
+
+    return _cov_corr(da_a, da_b, dim=dim, ddof=ddof, method="cov")
+
+
+def corr(da_a, da_b, dim=None):
+    """
+    Compute the Pearson correlation coefficient between
+    two DataArray objects along a shared dimension.
+
+    Parameters
+    ----------
+    da_a: DataArray object
+        Array to compute.
+    da_b: DataArray object
+        Array to compute.
+    dim: str, optional
+        The dimension along which the correlation will be computed
+
+    Returns
+    -------
+    correlation: DataArray
+
+    See also
+    --------
+    pandas.Series.corr: corresponding pandas function
+    xr.cov: underlying covariance function
+
+    Examples
+    --------
+    >>> da_a = DataArray(np.array([[1, 2, 3], [0.1, 0.2, 0.3], [3.2, 0.6, 1.8]]),
+    ...                  dims=("space", "time"),
+    ...                  coords=[('space', ['IA', 'IL', 'IN']),
+    ...                          ('time', pd.date_range("2000-01-01", freq="1D", periods=3))])
+    >>> da_a
+    <xarray.DataArray (space: 3, time: 3)>
+    array([[1. , 2. , 3. ],
+           [0.1, 0.2, 0.3],
+           [3.2, 0.6, 1.8]])
+    Coordinates:
+      * space    (space) <U2 'IA' 'IL' 'IN'
+      * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
+    >>> da_a = DataArray(np.array([[0.2, 0.4, 0.6], [15, 10, 5], [3.2, 0.6, 1.8]]),
+    ...                  dims=("space", "time"),
+    ...                  coords=[('space', ['IA', 'IL', 'IN']),
+    ...                          ('time', pd.date_range("2000-01-01", freq="1D", periods=3))])
+    >>> da_b
+    <xarray.DataArray (space: 3, time: 3)>
+    array([[ 0.2,  0.4,  0.6],
+           [15. , 10. ,  5. ],
+           [ 3.2,  0.6,  1.8]])
+    Coordinates:
+      * space    (space) <U2 'IA' 'IL' 'IN'
+      * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
+    >>> xr.corr(da_a, da_b)
+    <xarray.DataArray ()>
+    array(-0.57087777)
+    >>> xr.corr(da_a, da_b, dim='time')
+    <xarray.DataArray (space: 3)>
+    array([ 1., -1.,  1.])
+    Coordinates:
+      * space    (space) <U2 'IA' 'IL' 'IN'
+    """
+    from .dataarray import DataArray
+
+    if any(not isinstance(arr, DataArray) for arr in [da_a, da_b]):
+        raise TypeError(
+            "Only xr.DataArray is supported."
+            "Given {}.".format([type(arr) for arr in [da_a, da_b]])
+        )
+
+    return _cov_corr(da_a, da_b, dim=dim, method="corr")
+
+
+def _cov_corr(da_a, da_b, dim=None, ddof=0, method=None):
+    """
+    Internal method for xr.cov() and xr.corr() so only have to
+    sanitize the input arrays once and we don't repeat code.
+    """
+    # 1. Broadcast the two arrays
+    da_a, da_b = align(da_a, da_b, join="inner", copy=False)
+
+    # 2. Ignore the nans
+    valid_values = da_a.notnull() & da_b.notnull()
+
+    if not valid_values.all():
+        da_a = da_a.where(valid_values)
+        da_b = da_b.where(valid_values)
+
+    valid_count = valid_values.sum(dim) - ddof
+
+    # 3. Detrend along the given dim
+    demeaned_da_a = da_a - da_a.mean(dim=dim)
+    demeaned_da_b = da_b - da_b.mean(dim=dim)
+
+    # 4. Compute covariance along the given dim
+    # N.B. `skipna=False` is required or there is a bug when computing
+    # auto-covariance. E.g. Try xr.cov(da,da) for
+    # da = xr.DataArray([[1, 2], [1, np.nan]], dims=["x", "time"])
+    cov = (demeaned_da_a * demeaned_da_b).sum(dim=dim, skipna=False) / (valid_count)
+
+    if method == "cov":
+        return cov
+
+    else:
+        # compute std + corr
+        da_a_std = da_a.std(dim=dim)
+        da_b_std = da_b.std(dim=dim)
+        corr = cov / (da_a_std * da_b_std)
+        return corr
+
+
 def dot(*arrays, dims=None, **kwargs):
     """Generalized dot product for xarray objects. Like np.einsum, but
     provides a simpler interface based on array dimensions.

diff --git a/xarray/core/dataset.py b/xarray/core/dataset.py
@@ -1707,7 +1707,10 @@ def chunks(self) -> Mapping[Hashable, Tuple[int, ...]]:
     def chunk(
         self,
         chunks: Union[
-            None, Number, Mapping[Hashable, Union[None, Number, Tuple[Number, ...]]]
+            None,
+            Number,
+            str,
+            Mapping[Hashable, Union[None, Number, str, Tuple[Number, ...]]],
         ] = None,
         name_prefix: str = "xarray-",
         token: str = None,
@@ -1725,7 +1728,7 @@ def chunk(
 
         Parameters
         ----------
-        chunks : int or mapping, optional
+        chunks : int, 'auto' or mapping, optional
             Chunk sizes along each dimension, e.g., ``5`` or
             ``{'x': 5, 'y': 5}``.
         name_prefix : str, optional
@@ -1742,7 +1745,7 @@ def chunk(
         """
         from dask.base import tokenize
 
-        if isinstance(chunks, Number):
+        if isinstance(chunks, (Number, str)):
             chunks = dict.fromkeys(self.dims, chunks)
 
         if chunks is not None:

diff --git a/xarray/core/missing.py b/xarray/core/missing.py
@@ -619,6 +619,19 @@ def interp(var, indexes_coords, method, **kwargs):
     # default behavior
     kwargs["bounds_error"] = kwargs.get("bounds_error", False)
 
+    # check if the interpolation can be done in orthogonal manner
+    if (
+        len(indexes_coords) > 1
+        and method in ["linear", "nearest"]
+        and all(dest[1].ndim == 1 for dest in indexes_coords.values())
+        and len(set([d[1].dims[0] for d in indexes_coords.values()]))
+        == len(indexes_coords)
+    ):
+        # interpolate sequentially
+        for dim, dest in indexes_coords.items():
+            var = interp(var, {dim: dest}, method, **kwargs)
+        return var
+
     # target dimensions
     dims = list(indexes_coords)
     x, new_x = zip(*[indexes_coords[d] for d in dims])
@@ -659,7 +672,7 @@ def interp_func(var, x, new_x, method, kwargs):
         New coordinates. Should not contain NaN.
     method: string
         {'linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic'} for
-        1-dimensional itnterpolation.
+        1-dimensional interpolation.
         {'linear', 'nearest'} for multidimensional interpolation
     **kwargs:
         Optional keyword arguments to be passed to scipy.interpolator

diff --git a/xarray/core/weighted.py b/xarray/core/weighted.py
@@ -142,7 +142,14 @@ def _sum_of_weights(
         # we need to mask data values that are nan; else the weights are wrong
         mask = da.notnull()
 
-        sum_of_weights = self._reduce(mask, self.weights, dim=dim, skipna=False)
+        # bool -> int, because ``xr.dot([True, True], [True, True])`` -> True
+        # (and not 2); GH4074
+        if self.weights.dtype == bool:
+            sum_of_weights = self._reduce(
+                mask, self.weights.astype(int), dim=dim, skipna=False
+            )
+        else:
+            sum_of_weights = self._reduce(mask, self.weights, dim=dim, skipna=False)
 
         # 0-weights are not valid
         valid_weights = sum_of_weights != 0.0