Feature request: Compute cross-correlation (similar to pd.Series.corr()) of gridded data #1115

hrishikeshac · 2016-11-13T21:29:04Z

As a earth scientist regularly dealing with 3D data (time, latitude, longitude), I believe it would be great to be able to perform cross-correlation on DataArrays by specifying the axis. It's usage could look like: a.corr(b, axis = 0). It would be even more useful if the two arrays need not have the same dimensions (e.g. 'b' could be a time series).

Currently, the only way to compute this that I am aware of, is by looping through each grid, converting the time series to pd.Series(), and then computing the correlation. This takes a long time.
Would also appreciate suggestions to a faster algorithm.

shoyer · 2016-11-13T22:57:02Z

The first step here is to find a library that implements the desired functionality on pure NumPy arrays, ideally in a vectorized fashion. Then it should be pretty straightforward to wrap in xarray.

rabernat · 2016-11-14T13:24:10Z

I agree this would be very useful. But it is also feature creep. There is an extremely wide range of such functions that could hypothetically be put into the xarray package. (all of scipy.signal for example) At some point the community should decide what is the intended scope of xarray itself vs. packages built on top of xarray.

serazing · 2016-11-14T16:10:55Z

I agree with @rabernat in the sense that it could be part of another package (e.g., signal processing). This would also allow the computation of statistical test to assess the significance of the correlation (which is useful since correlation may often be misinterpreted without statistical tests).

shoyer · 2016-11-14T16:20:14Z

That said, correlation coefficients are a pretty fundamental operation for working with data. I could see implementing a basic corr in xarray and referring to a separate signal processing package for more options in the docstring.

rabernat · 2016-11-14T16:37:05Z

To be clear, I am not say that this does not belong in xarray.

I'm saying that we lack clear general guidelines for how to determine whether a particular function belongs in xarray. The criterion of a "pretty fundamental operation for working with data" is a good starting point. I would add:

used across a wide range of scientific disciplines
clear, unambiguous / uncontroversial definition
numpy implementation already exists

corr meets all of these criteria. Many others (e.g. interpolation, convolution, curve fitting) do as well. Expanding xarray beyond the numpy ufuncs opens the door to supporting these things. I'm just saying it should be conscious, deliberate decision, given the limits on developer time.

Many of these things will be pretty trivial once .apply() is here. So perhaps it's not a big deal.

fmaussion · 2016-12-12T19:23:48Z

I'll chime in here to ask a usage question: what is the recommended way to compute correlation maps with xarray? I.e. I have a dataarray of dims (time, lat, lon) and I'd like to correlate every single grid point with a timeseries of dim (time) to get a correlation map of dim (lat, lon). My current strategy is a wonderfully unpythonic double loop over lons and lats, and I wonder if there's better way?

hrishikeshac · 2017-09-24T04:14:00Z

FYI @shoyer @fmaussion , I had to revisit the problem and ended up writing a function to compute vectorized cross-correlation, covariance, regression calculations (along with p-value and standard error) for xr.DataArrays. Essentially, I tried to mimic scipy.stats.linregress() but for multi-dimensional data, and included the ability to compute lagged relationships. Here's the function and its demonstration; please feel free to incorporate it in xarray if deemed useful: https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html

sebhahn · 2017-12-06T15:17:40Z

@hrishikeshac I was just looking for a function doing a regression between two datasets (x, y, time), so thanks for your function! However, I'm still wondering whether there is a much faster C (or Cython) implementation doing these kind of things?

max-sixty · 2018-08-31T22:14:19Z

I'm up for adding .corr to xarray

What do want this to look like? It's a bit different from most xarray functions, which either return the same shape or reduce one dimension.

The basic case here would take a n x m array and return an m x m correlation matrix. We could easily wrap https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html
Another case would be take two similarly sized arrays (with the option of broadcasting) and return an array with one dimension reduced. For example 200 x 10 and 200, return a 10 array.
I need to think about how those extrapolate to multiple dimensions

Should I start with the first case and then we can expand as needed?

shoyer · 2018-08-31T23:55:06Z

I tend to view the second case as a generalization of the first case. I would also hesitate to implement the n x m array -> m x m correlation matrix version because xarray doesn't handle repeated dimensions well.

I think the basic implementation of this looks quite similar to what I wrote here for calculating the Pearson correlation as a NumPy gufunc:
http://xarray.pydata.org/en/stable/dask.html#automatic-parallelization

The main difference is that we might naturally want to support summing over multiple dimensions at once via the dim argument, e.g., something like:

# untested!
def covariance(x, y, dim=None):
    return xarray.dot(x - x.mean(dim), y - y.mean(dim), dim=dim)

def corrrelation(x, y, dim=None):
    # dim should default to the intersection of x.dims and y.dims
    return covariance(x, y, dim) / (x.std(dim) * y.std(dim))

If you want to achieve the equivalent of np.corr on an array with dimensions ('n', 'm') with this, you just write something like correlation(x, x.rename({'m': 'm2'}), dim='n').

hrishikeshac · 2018-09-04T15:15:35Z

Sometime back I wrote a package based on xarray regarding this. I would be happy to be involved in implementing it in xarray as well, but I am new to contributing to such a large-scale project and it looks a bit intimidating!

max-sixty · 2018-09-04T21:52:22Z

@hrishikeshac if you'd like to contribute, we can help you along - xarray is a v welcoming project!

And from mvstats it looks like you're already up to speed

Let us know

hrishikeshac · 2018-09-07T16:55:13Z

@max-sixty thanks!

Then I will start with testing @shoyer 's suggestion and mvstats for the basic implementation.

max-sixty · 2018-09-07T17:59:55Z

Great! Ping me / the issues with any questions at all!

max-sixty · 2018-11-07T21:31:12Z

For posterity, I made a small adjustment to @shoyer 's draft:

# untested!
def covariance(x, y, dim=None):
    # need to ensure the dim lengths are the same - i.e. no auto-aligning
    # could use count-1 for sample
    return xr.dot(x - x.mean(dim), y - y.mean(dim), dims=dim) / x.count(dim)

def correlation(x, y, dim=None):
    # dim should default to the intersection of x.dims and y.dims
    return covariance(x, y, dim) / (x.std(dim) * y.std(dim))

max-sixty · 2018-11-13T17:51:56Z

And one that handles NaNs:

# untested!
def covariance(x, y, dim=None):
    valid_values = x.notnull() & y.notnull()
    valid_count = valid_values.sum(dim)

    demeaned_x = (x - x.mean(dim)).fillna(0)
    demeaned_y = (y - y.mean(dim)).fillna(0)
    
    return xr.dot(demeaned_x, demeaned_y, dims=dim) / valid_count

def correlation(x, y, dim=None):
    # dim should default to the intersection of x.dims and y.dims
    return covariance(x, y, dim) / (x.std(dim) * y.std(dim))

rabernat · 2018-11-29T14:43:13Z

Hey @hrishikeshac -- any progress on this? Need any help / advice from xarray devs?

hrishikeshac · 2018-11-29T21:09:55Z

Sorry for the radio silence- I will work on this next week. Thanks @max-sixty for the updates, @rabernat for reaching out, will let you know if I need help.

Should we keep it simple following @max-sixty , or should I also add the functionality to handle lagged correlations?

dcherian · 2018-12-07T22:32:50Z

I think lagged correlations would be a useful feature.

max-sixty · 2018-12-07T22:42:57Z

Yes for useful, but not sure whether they should be on the same method. They're also fairly easy for a user to construct (call correlation on a .shift copy of the array).

And increments are easy to build on! I'm the worst offender, but don't let completeness get in the way of incremental improvement

(OK, I'll go and finish the fill_value branch...)

hrishikeshac · 2018-12-07T22:53:06Z

Okay. I am writing the simultaneous correlation and covariance functions on dataxarray.py instead of dataset.py- following the pd.Series.corr(self, other, dim) style.

hrishikeshac · 2019-01-03T04:10:35Z

Okay. Here's what I have come up with. I have tested it against two 1-d dataarrays, 2 N-D dataarrays, and one 1-D, and another N-D dataarrays, all cases having misaligned and having missing values.

Before going forward,

What do you think of it? Any improvements?
Steps 1 and 2 (broadcasting and ignoring common missing values) are identical in both cov() and corr(). Is there a better way to reduce the duplication while still retaining both functions as standalone?

def cov(self, other, dim = None):
    """Compute covariance between two DataArray objects along a shared dimension.

    Parameters
    ----------
    other: DataArray
        The other array with which the covariance will be computed
    dim: The dimension along which the covariance will be computed

    Returns
    -------
    covariance: DataArray
    """
    # 1. Broadcast the two arrays
    self, other     = xr.broadcast(self, other)
    
    # 2. Ignore the nans
    valid_values    = self.notnull() & other.notnull()
    self            = self.where(valid_values, drop=True)
    other           = other.where(valid_values, drop=True)
    valid_count     = valid_values.sum(dim)
    
    #3. Compute mean and standard deviation along the given dim
    demeaned_self   = self - self.mean(dim = dim)
    demeaned_other  = other - other.mean(dim = dim)
    
    #4. Compute  covariance along the given dim
    if dim:
        axis = self.get_axis_num(dim = dim)
    else:
        axis = None
    cov             =  np.sum(demeaned_self*demeaned_other, axis=axis)/(valid_count)
    
    return cov

def corr(self, other, dim = None):
    """Compute correlation between two DataArray objects along a shared dimension.

    Parameters
    ----------
    other: DataArray
        The other array with which the correlation will be computed
    dim: The dimension along which the correlation will be computed

    Returns
    -------
    correlation: DataArray
    """
    # 1. Broadcast the two arrays
    self, other     = xr.broadcast(self, other)
    
    # 2. Ignore the nans
    valid_values    = self.notnull() & other.notnull()
    self            = self.where(valid_values, drop=True)
    other           = other.where(valid_values, drop=True)
    
    # 3. Compute correlation based on standard deviations and cov()
    self_std        = self.std(dim=dim)
    other_std       = other.std(dim=dim)
    
    return cov(self, other, dim = dim)/(self_std*other_std)

For testing:

    # self: Load demo data and trim it's size
    ds  = xr.tutorial.load_dataset('air_temperature')
    air = ds.air[:18,...]
    # other: select missaligned data, and smooth it to dampen the correlation with self.
    air_smooth = ds.air[2:20,...].rolling(time= 3, center=True).mean(dim='time') #.
    # A handy function to select an example grid
    def select_pts(da):
        return da.sel(lat=45, lon=250)

    #Test #1: Misaligned 1-D dataarrays with missing values
    ts1 = select_pts(air.copy())
    ts2 = select_pts(air_smooth.copy())

    def pd_corr(ts1,ts2):
        """Ensure the ts are aligned and missing values ignored"""
        # ts1,ts2 = xr.align(ts1,ts2)
        valid_values = ts1.notnull() & ts2.notnull()

        ts1  = ts1.where(valid_values, drop = True)
        ts2  = ts2.where(valid_values, drop = True)

        return ts1.to_series().corr(ts2.to_series())

    expected = pd_corr(ts1, ts2)
    actual   = corr(ts1,ts2)
    np.allclose(expected, actual)

    #Test #2: Misaligned N-D dataarrays with missing values
    actual_ND = corr(air,air_smooth, dim = 'time')
    actual = select_pts(actual_ND)
    np.allclose(expected, actual)

    # Test #3: One 1-D dataarray and another N-D dataarray; misaligned and having missing values
    actual_ND = corr(air_smooth,ts1, dim = 'time')
    actual    = select_pts(actual_ND)
    np.allclose(actual, expected)

max-sixty · 2019-01-04T22:35:15Z

@hrishikeshac that looks great! Well done for getting an MVP running.

Do you want to do a PR from this? Should be v close from here.

Others can comment from there. I'd suggest we get something close to this in and iterate from there. How abstract do we want the dimensions to be (i.e. currently we can only pass one dimension in, which is fine, but potentially we could enable multiple).

One nit - no need to use np.sum - that may cause issues with dask arrays - .sum will work fine

hrishikeshac · 2019-01-04T23:48:54Z

PR done!
Changed np.sum() to dataarray.sum()

patrickcgray · 2019-10-24T15:59:35Z

I see that this PR never made it through and there is a somewhat similar PR finished here: #2350 though it doesn't do exactly what was proposed in this PR. Is there a suggested approach for performing cross-correlation on multiple DataArray?

max-sixty · 2019-10-25T02:38:40Z

Would be great to get this in, if anyone wants to have a go. A small, focused, PR would be a good start.

In the meantime you can use one of the solutions above...

hrishikeshac · 2019-11-04T19:31:46Z

Guys sorry for dropping the ball on this one. I made some changes to the PR based on the feedback I got, but I couldn't figure out the tests. Would anyone like to take this over?

r-beer · 2019-11-19T07:44:23Z

I am also highly interested in this function and in contributing to xarray in general!

If I understand correctly, #2350 and #2652 do not solve this PR, do they?

How can I help you finishing these PRs?

max-sixty · 2019-11-19T15:39:17Z

@r-beer would be great to finish this off! I think this would be a popular feature. You could take @hrishikeshac 's code (which is close!) and make the final changes.

r-beer · 2019-11-19T21:36:42Z

@r-beer would be great to finish this off! I think this would be a popular feature. You could take @hrishikeshac 's code (which is close!) and make the final changes.

OK, that means to make #2652 pass, right?

I downloaded the respective branch from @hrishikeshac, and ran the tests locally.

See respective discussion in #2652.

max-sixty · 2020-02-20T16:00:56Z

@r-beer I checked back on this and realized I didn't reply to your question: yes re completing #2652, if you're up for giving this a push

raybellwaves mentioned this issue Aug 7, 2018

DOC: Add xskillscore to project lists #2350

Merged

dcherian closed this as completed in #2350 Aug 7, 2018

max-sixty reopened this Aug 31, 2018

hrishikeshac mentioned this issue Jan 4, 2019

cov() and corr() #2652

Closed

max-sixty mentioned this issue Feb 20, 2020

Function for regressing/correlating multiple fields? #3784

Closed

AndrewILWilliams mentioned this issue May 25, 2020

xr.cov() and xr.corr() #4089

Merged

5 tasks

max-sixty closed this as completed May 25, 2020

This was referenced Jun 8, 2023

Identify Xarray functions I'd like to add examples to yutik-nn/Xarray-adventure#3

Open

Create an Xarray feature/function yutik-nn/Xarray-adventure#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Compute cross-correlation (similar to pd.Series.corr()) of gridded data #1115

Feature request: Compute cross-correlation (similar to pd.Series.corr()) of gridded data #1115

hrishikeshac commented Nov 13, 2016 •

edited

Loading

shoyer commented Nov 13, 2016

rabernat commented Nov 14, 2016

serazing commented Nov 14, 2016

shoyer commented Nov 14, 2016

rabernat commented Nov 14, 2016

fmaussion commented Dec 12, 2016 •

edited

Loading

hrishikeshac commented Sep 24, 2017

sebhahn commented Dec 6, 2017

max-sixty commented Aug 31, 2018

shoyer commented Aug 31, 2018

hrishikeshac commented Sep 4, 2018

max-sixty commented Sep 4, 2018

hrishikeshac commented Sep 7, 2018

max-sixty commented Sep 7, 2018

max-sixty commented Nov 7, 2018 •

edited

Loading

max-sixty commented Nov 13, 2018

rabernat commented Nov 29, 2018

hrishikeshac commented Nov 29, 2018

dcherian commented Dec 7, 2018

max-sixty commented Dec 7, 2018

hrishikeshac commented Dec 7, 2018

hrishikeshac commented Jan 3, 2019 •

edited

Loading

max-sixty commented Jan 4, 2019

hrishikeshac commented Jan 4, 2019

patrickcgray commented Oct 24, 2019

max-sixty commented Oct 25, 2019

hrishikeshac commented Nov 4, 2019

r-beer commented Nov 19, 2019 •

edited

Loading

max-sixty commented Nov 19, 2019

r-beer commented Nov 19, 2019

max-sixty commented Feb 20, 2020

Feature request: Compute cross-correlation (similar to pd.Series.corr()) of gridded data #1115

Feature request: Compute cross-correlation (similar to pd.Series.corr()) of gridded data #1115

Comments

hrishikeshac commented Nov 13, 2016 • edited Loading

shoyer commented Nov 13, 2016

rabernat commented Nov 14, 2016

serazing commented Nov 14, 2016

shoyer commented Nov 14, 2016

rabernat commented Nov 14, 2016

fmaussion commented Dec 12, 2016 • edited Loading

hrishikeshac commented Sep 24, 2017

sebhahn commented Dec 6, 2017

max-sixty commented Aug 31, 2018

shoyer commented Aug 31, 2018

hrishikeshac commented Sep 4, 2018

max-sixty commented Sep 4, 2018

hrishikeshac commented Sep 7, 2018

max-sixty commented Sep 7, 2018

max-sixty commented Nov 7, 2018 • edited Loading

max-sixty commented Nov 13, 2018

rabernat commented Nov 29, 2018

hrishikeshac commented Nov 29, 2018

dcherian commented Dec 7, 2018

max-sixty commented Dec 7, 2018

hrishikeshac commented Dec 7, 2018

hrishikeshac commented Jan 3, 2019 • edited Loading

max-sixty commented Jan 4, 2019

hrishikeshac commented Jan 4, 2019

patrickcgray commented Oct 24, 2019

max-sixty commented Oct 25, 2019

hrishikeshac commented Nov 4, 2019

r-beer commented Nov 19, 2019 • edited Loading

max-sixty commented Nov 19, 2019

r-beer commented Nov 19, 2019

max-sixty commented Feb 20, 2020

hrishikeshac commented Nov 13, 2016 •

edited

Loading

fmaussion commented Dec 12, 2016 •

edited

Loading

max-sixty commented Nov 7, 2018 •

edited

Loading

hrishikeshac commented Jan 3, 2019 •

edited

Loading

r-beer commented Nov 19, 2019 •

edited

Loading