-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add updated xr_regression
function for multi-dimensional linear regression
#1226
Conversation
@robbibt Awesome! I was just looking for something like this function. A possible enhancement: any interest in including options for robust regression instead of OLS? This can be especially important for satellite time-series regression where slopes can be influenced by outlier values. Theil-sen slopes with a Mann Kendall test is a good example of robust regression. scipy theil-slopes: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.theilslopes.html An example wrapper implemented for xarray here: https://github.com/josuemtzmo/xarrayMannKendall/blob/master/xarrayMannKendall/xarrayMannKendall.py An example from my work implementing the above wrapper, go down to |
Thanks @cbur24! Expanding this to include multiple regression methods would be super neat - we could have a simple param like The current implementation is designed to be almost completely vectorised array maths, which makes it really quick and non-memory hungry. It would be neat to be able to also do that across different regression methods - at first glance the |
Totally agree that it would be complex addition, and from reading how seaborn/pandas does robust regression it seems as though most robust regression techniques are slow and/or memory intensive. In light of that, having the option for outlier detection and removal sounds like a great (lightweight) addition that would do 95 % of what robust regression does but without the overhead. |
@cbur24 Have added a simple MAD outlier detection function and params in |
@robbibt I did some basic testing of this function today using a couple of janky non-datacube netcdfs. The function works well (and fast!), with the exception that the lag parameter was failing. The following code:
results in this error:
Weirdly, when I run the same lagged function call but with dask, I don't receive an error but the result is an all-NaN array. I'm guessing it maybe has something to do with dimension alignment. The input xarray datasets |
Thanks @cbur24! I'll admit that the lag functionality was the one bit I didn't test in this re-write. 🙃 Good catch! I'll see if I can reproduce this on my end, otherwise I might grab the files you're using if they're sharable (will let you know). If the lag stuff proves too complicated, I'm tempted to just leave it out of the function and let users handle that themselves outside of the func. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Thanks for the great feedback @cbur24! There's a few issues with the current code (most importantly, I think it's probably best that users apply lags externally to the function, and make sure their datasets are perfectly compatible and ready for comparison before passing them in for comparison. 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your rationale for dropping the lag functionality makes total sense to me @robbibt; I agree its preferable to have the user decide the nature of the inputs to xr_regression
.
Just on dropna(), I believe you can specify a how='all'
or how='any'
parameter to decide whether to drop dimensions where there are 'any' or 'all' NaNs
* Add updated `xr_regression` function for multi-dimensional linear regression (#1226) * Add updated xr_regression function * Add dask support for lazy computation * Set dtypes * Update docstring * Update docstring * Add MAD outliers * Update docstring * Remove lag functionality * Update docstrings * Add better error handling * Update stream gauge corr notebook to use new func * Adding DEAfrica Wetland Turbidity notebook for Australian study site (#1175) * Adding DEAfrica Wetland Turbidity notebook for Australian study site * change all instances of NDTI to NDTI2 to reflect usage at top of notebook * update notebook to use Collection 3 WO Statistics * rerun notebook --------- Co-authored-by: BexDunn <bex.dunn@ga.gov.au> * Add spatial interpolation with `xr_interpolate` notebook (#1233) * Add ensemble tide modelling functionality to model_tides * Update test_coastal.py * Remove test * Updates to IDW, xr_interpolate and ensemble tide modelling code" * Doco updates * Switch ensemble rankings from high to low = good * Update docstring * Fix doco * Add interpolation notebook * Remove coastal files from branch * Add points data * Review feedback; * Add p param to IDW * Fix test * Updates to product notebook Knowledge Hub links and DEA notebook content (#1221) * Move KH links into consistent alert box format * Update DEA notebook * Minor wording updates * Minor wording * Temporarily remove STAC notebook from tests * Add ensemble tide modelling functionality to `model_tides` (#1231) * Add ensemble tide modelling functionality to model_tides * Update test_coastal.py * Remove test * Updates to IDW, xr_interpolate and ensemble tide modelling code" * Doco updates * Switch ensemble rankings from high to low = good * Update docstring * Fix doco * Add interpolation notebook --------- Co-authored-by: Matt-dea <129345253+Matt-dea@users.noreply.github.com> Co-authored-by: BexDunn <bex.dunn@ga.gov.au>
Proposed changes
This PR updates the older
lag_linregress_3d
function into a new and improvedxr_regression
function for calculating useful regression statistics between two multi-dimensional xarray datasets, including:For example, the function can be used to calculate regressions between two 3D datasets (e.g. time, x, y), or between a 3D (time, x, y) dataset and a 1D dataset (time):
This PR includes tests that verify that the results produced by this function are identical to those produced by the
scipy.stats.linregress
function (including for "two-sided", "less" and "greater" alternative hypotheses).Checklist
(Replace
[ ]
with[x]
to check off)Load packages
General advice
)jupyterlab_code_formatter
tool can be used to format code cells to a consistent style: select each code cell, then clickEdit
and then one of theApply X Formatter
options (YAPF
orBlack
are recommended).NCI
andDEA Sandbox
(flag if not working as part of PR and ask for help to solve if needed)Notebook currently compatible with the NCI|DEA Sandbox environment only
line below the notebook title to reflect the environments the notebook is compatible with