Add updated `xr_regression` function for multi-dimensional linear regression #1226

robbibt · 2024-05-09T06:49:31Z

Proposed changes

This PR updates the older lag_linregress_3d function into a new and improved xr_regression function for calculating useful regression statistics between two multi-dimensional xarray datasets, including:

Regression slope, standard error, intercept
P-value
Correlation and covariance
R-squared

For example, the function can be used to calculate regressions between two 3D datasets (e.g. time, x, y), or between a 3D (time, x, y) dataset and a 1D dataset (time):

This PR includes tests that verify that the results produced by this function are identical to those produced by the scipy.stats.linregress function (including for "two-sided", "less" and "greater" alternative hypotheses).

Checklist

(Replace [ ] with [x] to check off)

cbur24 · 2024-05-17T04:02:55Z

@robbibt Awesome! I was just looking for something like this function. A possible enhancement: any interest in including options for robust regression instead of OLS? This can be especially important for satellite time-series regression where slopes can be influenced by outlier values. Theil-sen slopes with a Mann Kendall test is a good example of robust regression.

scipy theil-slopes: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.theilslopes.html

An example wrapper implemented for xarray here: https://github.com/josuemtzmo/xarrayMannKendall/blob/master/xarrayMannKendall/xarrayMannKendall.py

An example from my work implementing the above wrapper, go down to In [11]
https://github.com/cbur24/AusENDVI/blob/main/notebooks/analysis/Trends_in_Seasonality.ipynb

robbibt · 2024-05-19T23:59:47Z

@robbibt Awesome! I was just looking for something like this function. A possible enhancement: any interest in including options for robust regression instead of OLS? This can be especially important for satellite time-series regression where slopes can be influenced by outlier values. Theil-sen slopes with a Mann Kendall test is a good example of robust regression.

scipy theil-slopes: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.theilslopes.html

An example wrapper implemented for xarray here: https://github.com/josuemtzmo/xarrayMannKendall/blob/master/xarrayMannKendall/xarrayMannKendall.py

An example from my work implementing the above wrapper, go down to In [11] https://github.com/cbur24/AusENDVI/blob/main/notebooks/analysis/Trends_in_Seasonality.ipynb

Thanks @cbur24! Expanding this to include multiple regression methods would be super neat - we could have a simple param like method= to change the internal func and still return a consistent set of outputs.

The current implementation is designed to be almost completely vectorised array maths, which makes it really quick and non-memory hungry. It would be neat to be able to also do that across different regression methods - at first glance the xarrayMannKendall implementation above seems pretty complex and maybe not something that could be easily switched to array maths. I wonder if a simple addition at this stage could be adding something like robust outlier detection, e.g. with MAD or RANSAC or similar? Not quite as good as a dedicated robust regression, but could make it a bit more useful for really noisy data...

cbur24 · 2024-05-20T00:27:03Z

Totally agree that it would be complex addition, and from reading how seaborn/pandas does robust regression it seems as though most robust regression techniques are slow and/or memory intensive. In light of that, having the option for outlier detection and removal sounds like a great (lightweight) addition that would do 95 % of what robust regression does but without the overhead.

robbibt · 2024-05-21T00:17:24Z

@cbur24 Have added a simple MAD outlier detection function and params in xr_regression to apply it (753b368) - I'm not sure it's the best approach as the outlier detection doesn't take account of the relationship between the two variables (it's univariate only). But it's there anyway as an experimental feature, so we can see if it's useful! 🙂 Definitely keen to expand this func to robust regression in the future though, so if you ever want to raise a PR, feel free!

cbur24 · 2024-05-30T01:05:29Z

@robbibt I did some basic testing of this function today using a couple of janky non-datacube netcdfs. The function works well (and fast!), with the exception that the lag parameter was failing. The following code:

reg = xr_regression(ndvi, vpd, dim='time', lag_y=1)

results in this error:

ValueError: conflicting sizes for dimension 'latitude': length 1 on <this-array> and length 680 on {'longitude': 'longitude', 'latitude': 'latitude', 'time': 'time'} at line 946: cov = ((x - xmean) * (y - ymean)).sum(dim=dim) / (n)

Weirdly, when I run the same lagged function call but with dask, I don't receive an error but the result is an all-NaN array. I'm guessing it maybe has something to do with dimension alignment.

The input xarray datasets ndvi and vpd look like this:

robbibt · 2024-05-30T01:29:21Z

@robbibt I did some basic testing of this function today using a couple of janky non-datacube netcdfs. The function works well (and fast!), with the exception that the lag parameter was failing. The following code:

reg = xr_regression(ndvi, vpd, dim='time', lag_y=1)

results in this error:

ValueError: conflicting sizes for dimension 'latitude': length 1 on <this-array> and length 680 on {'longitude': 'longitude', 'latitude': 'latitude', 'time': 'time'} at line 946: cov = ((x - xmean) * (y - ymean)).sum(dim=dim) / (n)

Weirdly, when I run the same lagged function call but with dask, I don't receive an error but the result is an all-NaN array. I'm guessing it maybe has something to do with dimension alignment.

The input xarray datasets ndvi and vpd look like this:

Thanks @cbur24! I'll admit that the lag functionality was the one bit I didn't test in this re-write. 🙃 Good catch! I'll see if I can reproduce this on my end, otherwise I might grab the files you're using if they're sharable (will let you know). If the lag stuff proves too complicated, I'm tempted to just leave it out of the function and let users handle that themselves outside of the func.

review-notebook-app · 2024-05-30T07:49:48Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

robbibt · 2024-05-30T07:49:57Z

Thanks for the great feedback @cbur24! There's a few issues with the current code (most importantly, dropna in the lag code would drop every timestep with any nodata... even a single pixel!), but the more I think about this, the more I think this functionality doesn't belong in xr_regression... it's not super clear exactly how the dimensions are aligned after lagging, and I think there's a very high risk of accidently producing results that are invalid and yet really opaque and difficult to troubleshoot.

I think it's probably best that users apply lags externally to the function, and make sure their datasets are perfectly compatible and ready for comparison before passing them in for comparison. 🙂

cbur24

Your rationale for dropping the lag functionality makes total sense to me @robbibt; I agree its preferable to have the user decide the nature of the inputs to xr_regression.

Just on dropna(), I believe you can specify a how='all' or how='any' parameter to decide whether to drop dimensions where there are 'any' or 'all' NaNs

* Add updated `xr_regression` function for multi-dimensional linear regression (#1226) * Add updated xr_regression function * Add dask support for lazy computation * Set dtypes * Update docstring * Update docstring * Add MAD outliers * Update docstring * Remove lag functionality * Update docstrings * Add better error handling * Update stream gauge corr notebook to use new func * Adding DEAfrica Wetland Turbidity notebook for Australian study site (#1175) * Adding DEAfrica Wetland Turbidity notebook for Australian study site * change all instances of NDTI to NDTI2 to reflect usage at top of notebook * update notebook to use Collection 3 WO Statistics * rerun notebook --------- Co-authored-by: BexDunn <bex.dunn@ga.gov.au> * Add spatial interpolation with `xr_interpolate` notebook (#1233) * Add ensemble tide modelling functionality to model_tides * Update test_coastal.py * Remove test * Updates to IDW, xr_interpolate and ensemble tide modelling code" * Doco updates * Switch ensemble rankings from high to low = good * Update docstring * Fix doco * Add interpolation notebook * Remove coastal files from branch * Add points data * Review feedback; * Add p param to IDW * Fix test * Updates to product notebook Knowledge Hub links and DEA notebook content (#1221) * Move KH links into consistent alert box format * Update DEA notebook * Minor wording updates * Minor wording * Temporarily remove STAC notebook from tests * Add ensemble tide modelling functionality to `model_tides` (#1231) * Add ensemble tide modelling functionality to model_tides * Update test_coastal.py * Remove test * Updates to IDW, xr_interpolate and ensemble tide modelling code" * Doco updates * Switch ensemble rankings from high to low = good * Update docstring * Fix doco * Add interpolation notebook --------- Co-authored-by: Matt-dea <129345253+Matt-dea@users.noreply.github.com> Co-authored-by: BexDunn <bex.dunn@ga.gov.au>

robbibt added 2 commits May 9, 2024 06:46

Add updated xr_regression function

0823057

Merge develop

d28bb0d

robbibt marked this pull request as ready for review May 9, 2024 07:33

robbibt requested review from BexDunn, erialC-P, uchchwhash, Kooie-cate, geoscience-aman, JM-GA, margaretharrison, vnewey, Ariana-B, amanda2099, supermarkion and erin-telfer as code owners May 9, 2024 07:33

robbibt added 4 commits May 10, 2024 01:22

Add dask support for lazy computation

e4bd2cb

Set dtypes

fe095c3

Update docstring

82f059e

Update docstring

458c9dc

robbibt added 2 commits May 21, 2024 00:13

Add MAD outliers

753b368

Merge

b23b6e6

robbibt requested a review from cbur24 May 21, 2024 00:15

Update docstring

ca13bda

Remove lag functionality

c950aca

robbibt added 3 commits May 30, 2024 06:07

Update docstrings

4efed51

Add better error handling

fcac765

Update stream gauge corr notebook to use new func

3065ec0

cbur24 approved these changes May 31, 2024

View reviewed changes

robbibt merged commit 40b532b into develop Jun 4, 2024
1 check passed

robbibt deleted the xr_regression branch June 4, 2024 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add updated `xr_regression` function for multi-dimensional linear regression #1226

Add updated `xr_regression` function for multi-dimensional linear regression #1226

robbibt commented May 9, 2024 •

edited

Loading

cbur24 commented May 17, 2024 •

edited

Loading

robbibt commented May 19, 2024

cbur24 commented May 20, 2024

robbibt commented May 21, 2024 •

edited

Loading

cbur24 commented May 30, 2024

robbibt commented May 30, 2024

review-notebook-app bot commented May 30, 2024

robbibt commented May 30, 2024

cbur24 left a comment

Add updated xr_regression function for multi-dimensional linear regression #1226

Add updated xr_regression function for multi-dimensional linear regression #1226

Conversation

robbibt commented May 9, 2024 • edited Loading

Proposed changes

Checklist

cbur24 commented May 17, 2024 • edited Loading

robbibt commented May 19, 2024

cbur24 commented May 20, 2024

robbibt commented May 21, 2024 • edited Loading

cbur24 commented May 30, 2024

robbibt commented May 30, 2024

review-notebook-app bot commented May 30, 2024

robbibt commented May 30, 2024

cbur24 left a comment

Choose a reason for hiding this comment

Add updated `xr_regression` function for multi-dimensional linear regression #1226

Add updated `xr_regression` function for multi-dimensional linear regression #1226

robbibt commented May 9, 2024 •

edited

Loading

cbur24 commented May 17, 2024 •

edited

Loading

robbibt commented May 21, 2024 •

edited

Loading