-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Compute cross-correlation (similar to pd.Series.corr()) of gridded data #1115
Comments
The first step here is to find a library that implements the desired functionality on pure NumPy arrays, ideally in a vectorized fashion. Then it should be pretty straightforward to wrap in xarray. |
I agree this would be very useful. But it is also feature creep. There is an extremely wide range of such functions that could hypothetically be put into the xarray package. (all of scipy.signal for example) At some point the community should decide what is the intended scope of xarray itself vs. packages built on top of xarray. |
I agree with @rabernat in the sense that it could be part of another package (e.g., signal processing). This would also allow the computation of statistical test to assess the significance of the correlation (which is useful since correlation may often be misinterpreted without statistical tests). |
That said, correlation coefficients are a pretty fundamental operation for working with data. I could see implementing a basic |
To be clear, I am not say that this does not belong in xarray. I'm saying that we lack clear general guidelines for how to determine whether a particular function belongs in xarray. The criterion of a "pretty fundamental operation for working with data" is a good starting point. I would add:
Many of these things will be pretty trivial once |
I'll chime in here to ask a usage question: what is the recommended way to compute correlation maps with xarray? I.e. I have a dataarray of dims |
FYI @shoyer @fmaussion , I had to revisit the problem and ended up writing a function to compute vectorized cross-correlation, covariance, regression calculations (along with p-value and standard error) for xr.DataArrays. Essentially, I tried to mimic scipy.stats.linregress() but for multi-dimensional data, and included the ability to compute lagged relationships. Here's the function and its demonstration; please feel free to incorporate it in xarray if deemed useful: https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html |
@hrishikeshac I was just looking for a function doing a regression between two datasets (x, y, time), so thanks for your function! However, I'm still wondering whether there is a much faster C (or Cython) implementation doing these kind of things? |
I'm up for adding What do want this to look like? It's a bit different from most xarray functions, which either return the same shape or reduce one dimension.
Should I start with the first case and then we can expand as needed? |
I tend to view the second case as a generalization of the first case. I would also hesitate to implement the I think the basic implementation of this looks quite similar to what I wrote here for calculating the Pearson correlation as a NumPy gufunc: The main difference is that we might naturally want to support summing over multiple dimensions at once via the # untested!
def covariance(x, y, dim=None):
return xarray.dot(x - x.mean(dim), y - y.mean(dim), dim=dim)
def corrrelation(x, y, dim=None):
# dim should default to the intersection of x.dims and y.dims
return covariance(x, y, dim) / (x.std(dim) * y.std(dim)) If you want to achieve the equivalent of |
Sometime back I wrote a package based on xarray regarding this. I would be happy to be involved in implementing it in xarray as well, but I am new to contributing to such a large-scale project and it looks a bit intimidating! |
@hrishikeshac if you'd like to contribute, we can help you along - xarray is a v welcoming project! And from Let us know |
@max-sixty thanks! Then I will start with testing @shoyer 's suggestion and |
Great! Ping me / the issues with any questions at all! |
For posterity, I made a small adjustment to @shoyer 's draft: # untested!
def covariance(x, y, dim=None):
# need to ensure the dim lengths are the same - i.e. no auto-aligning
# could use count-1 for sample
return xr.dot(x - x.mean(dim), y - y.mean(dim), dims=dim) / x.count(dim)
def correlation(x, y, dim=None):
# dim should default to the intersection of x.dims and y.dims
return covariance(x, y, dim) / (x.std(dim) * y.std(dim)) |
And one that handles # untested!
def covariance(x, y, dim=None):
valid_values = x.notnull() & y.notnull()
valid_count = valid_values.sum(dim)
demeaned_x = (x - x.mean(dim)).fillna(0)
demeaned_y = (y - y.mean(dim)).fillna(0)
return xr.dot(demeaned_x, demeaned_y, dims=dim) / valid_count
def correlation(x, y, dim=None):
# dim should default to the intersection of x.dims and y.dims
return covariance(x, y, dim) / (x.std(dim) * y.std(dim)) |
Hey @hrishikeshac -- any progress on this? Need any help / advice from xarray devs? |
Sorry for the radio silence- I will work on this next week. Thanks @max-sixty for the updates, @rabernat for reaching out, will let you know if I need help. Should we keep it simple following @max-sixty , or should I also add the functionality to handle lagged correlations? |
I think lagged correlations would be a useful feature. |
Yes for useful, but not sure whether they should be on the same method. They're also fairly easy for a user to construct (call correlation on a And increments are easy to build on! I'm the worst offender, but don't let completeness get in the way of incremental improvement (OK, I'll go and finish the |
Okay. I am writing the simultaneous correlation and covariance functions on dataxarray.py instead of dataset.py- following the pd.Series.corr(self, other, dim) style. |
Okay. Here's what I have come up with. I have tested it against two 1-d dataarrays, 2 N-D dataarrays, and one 1-D, and another N-D dataarrays, all cases having misaligned and having missing values. Before going forward,
For testing:
|
@hrishikeshac that looks great! Well done for getting an MVP running. Do you want to do a PR from this? Should be v close from here. Others can comment from there. I'd suggest we get something close to this in and iterate from there. How abstract do we want the dimensions to be (i.e. currently we can only pass one dimension in, which is fine, but potentially we could enable multiple). One nit - no need to use |
PR done! |
I see that this PR never made it through and there is a somewhat similar PR finished here: #2350 though it doesn't do exactly what was proposed in this PR. Is there a suggested approach for performing cross-correlation on multiple DataArray? |
Would be great to get this in, if anyone wants to have a go. A small, focused, PR would be a good start. In the meantime you can use one of the solutions above... |
Guys sorry for dropping the ball on this one. I made some changes to the PR based on the feedback I got, but I couldn't figure out the tests. Would anyone like to take this over? |
@r-beer would be great to finish this off! I think this would be a popular feature. You could take @hrishikeshac 's code (which is close!) and make the final changes. |
OK, that means to make #2652 pass, right? I downloaded the respective branch from @hrishikeshac, and ran the tests locally. See respective discussion in #2652. |
As a earth scientist regularly dealing with 3D data (time, latitude, longitude), I believe it would be great to be able to perform cross-correlation on DataArrays by specifying the axis. It's usage could look like: a.corr(b, axis = 0). It would be even more useful if the two arrays need not have the same dimensions (e.g. 'b' could be a time series).
Currently, the only way to compute this that I am aware of, is by looping through each grid, converting the time series to pd.Series(), and then computing the correlation. This takes a long time.
Would also appreciate suggestions to a faster algorithm.
The text was updated successfully, but these errors were encountered: