diff --git a/doc/source/intro_robuststats.md b/doc/source/intro_robuststats.md index 98e6ef5a..ea880a47 100644 --- a/doc/source/intro_robuststats.md +++ b/doc/source/intro_robuststats.md @@ -2,21 +2,17 @@ # The need for robust statistics -Digital Elevation Models often contain outliers that hamper further analysis. -In order to mitigate their effect on DEM analysis, xDEM integrates [robust statistics](https://en.wikipedia.org/wiki/Robust_statistics) -methods at different levels. -These methods can be used to robustly fit functions necessary to perform DEM alignment (see {ref}`coregistration`, {ref}`biascorr`), or to provide -robust statistical measures equivalent to the mean, the standard deviation or the covariance of a sample when analyzing DEM precision with -{ref}`spatialstats`. +Digital elevation models often contain outliers that can be traced back to instrument acquisition or processing artefacts, and which hamper further analysis. + +In order to mitigate their effect, xDEM integrates [robust statistics](https://en.wikipedia.org/wiki/Robust_statistics) at different levels: +- Robust optimizers for the fitting of parametric models during {ref}`coregistration` and {ref}`biascorr`, +- Robust measures for the central tendency (e.g., mean) and dispersion (e.g., standard deviation), to evaluate DEM quality and converge during {ref}`coregistration`, +- Robust measures for estimating spatial autocorrelation for uncertainty analysis in {ref}`spatialstats`. Yet, there is a downside to robust statistical measures. Those can yield less precise estimates for small samples sizes and, -in some cases, hide patterns inherent to the data. This is why, when outliers exhibit idenfiable patterns, it is better +in some cases, hide patterns inherent to the data. This is why, when outliers show identifiable patterns, it is better to first resort to outlier filtering (see {ref}`filters`) and perform analysis using traditional statistical measures. -```{contents} Contents -:local: true -``` - (robuststats-meanstd)= ## Measures of central tendency and dispersion @@ -38,8 +34,11 @@ The median is used by default in the alignment routines of {ref}`coregistration` The [statistical dispersion](https://en.wikipedia.org/wiki/Statistical_dispersion) represents the spread of a sample, and is core to the analysis of sample precision (see {ref}`intro`). It is typically measured by the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation). -However, very much like the mean, the standard deviation is a measure sensitive to outliers. The median equivalent of a -standard deviation is the normalized median absolute deviation (NMAD), which corresponds to the [median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) scaled by a factor of ~1.4826 to match the dispersion of a +However, very much like the mean, the standard deviation is a measure sensitive to outliers. + +(robuststats-nmad)= + +The median equivalent of a standard deviation is the normalized median absolute deviation (NMAD), which corresponds to the [median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) scaled by a factor of ~1.4826 to match the dispersion of a normal distribution. It has been shown to provide more robust measures of dispersion with outliers when working with DEMs (e.g., [Höhle and Höhle (2009)](https://doi.org/10.1016/j.isprsjprs.2009.02.003)). It is defined as: @@ -50,35 +49,29 @@ $$ where $x$ is the sample. -The half difference between 84{sup}`th` and 16{sup}`th` percentiles, or the absolute 68{sup}`th` percentile -can also be used as a robust dispersion measure equivalent to the standard deviation. - ```python nmad = xdem.spatialstats.nmad ``` +```{note} +The NMAD estimator has a good synergy with {ref}`Dowd's variogram` for spatial autocorrelation, as their median-based measure of dispersion is the same. +``` + +The half difference between 84{sup}`th` and 16{sup}`th` percentiles, or the absolute 68{sup}`th` percentile +can also be used as a robust dispersion measure equivalent to the standard deviation. When working with weighted data, the difference between the 84{sup}`th` and 16{sup}`th` [weighted percentile](https://en.wikipedia.org/wiki/Percentile#Weighted_percentile), or the absolute 68{sup}`th` weighted percentile can be used as a robust measure of dispersion. +```{important} The NMAD is used by default for estimating elevation measurement errors in {ref}`spatialstats`. +``` (robuststats-corr)= -## Measures of correlation - -### Correlation between samples - -The [covariance](https://en.wikipedia.org/wiki/Covariance) is the measure generally used to estimate the joint variability -of samples, often normalized to a [correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). -Again, the variance and covariance are sensitive measures to outliers. It is therefore preferable to compute such measures -by filtering the data, or using robust estimators. - -TODO - -### Spatial auto-correlation of a sample +## Measures of spatial autocorrelation [Variogram](https://en.wikipedia.org/wiki/Variogram) analysis exploits statistical measures equivalent to the covariance, and is therefore also subject to outliers. -Based on [scikit-gstat](https://mmaelicke.github.io/scikit-gstat/index.html), xDEM allows to specify robust variogram +Based on [SciKit-GStat](https://mmaelicke.github.io/scikit-gstat/index.html), xDEM allows to specify robust variogram estimators such as Dowd's variogram based on medians ([Dowd (1984)](https://en.wikipedia.org/wiki/Variogram)) defined as: $$ @@ -87,7 +80,15 @@ $$ where $h$ is the spatial lag and $Z_{x_{i}}$ is the value of the sample at the location $x_{i}$. +```{note} +Dowd's estimator has a good synergy with the {ref}`NMAD` for estimating the dispersion of the full sample, as their median-based measure of dispersion is the same (2.198 is the square of 1.4826). +``` + +Other estimators can be chosen from [SciKit-GStat's list of estimators](https://scikit-gstat.readthedocs.io/en/latest/reference/estimator.html). + +```{important} Dowd's variogram is used by default to estimate spatial auto-correlation of elevation measurement errors in {ref}`spatialstats`. +``` (robuststats-regression)= diff --git a/examples/advanced/plot_standardization.py b/examples/advanced/plot_standardization.py index c820dfdc..5a021f95 100644 --- a/examples/advanced/plot_standardization.py +++ b/examples/advanced/plot_standardization.py @@ -98,11 +98,12 @@ z_dh.data[np.abs(z_dh.data) > 4] = np.nan # %% -# We perform a scale-correction for the standardization, to ensure that the standard deviation of the data is exactly 1. -print(f"Standard deviation before scale-correction: {nmad(z_dh.data):.1f}") +# We perform a scale-correction for the standardization, to ensure that the spread of the data is exactly 1. +# The NMAD is used as a robust measure for the spread (see :ref:`robuststats-nmad`). +print(f"NMAD before scale-correction: {nmad(z_dh.data):.1f}") scale_fac_std = nmad(z_dh.data) z_dh = z_dh / scale_fac_std -print(f"Standard deviation after scale-correction: {nmad(z_dh.data):.1f}") +print(f"NMAD after scale-correction: {nmad(z_dh.data):.1f}") plt.figure(figsize=(8, 5)) plt_extent = [ @@ -123,8 +124,9 @@ # %% # Now, we can perform an analysis of spatial correlation as shown in the :ref:`sphx_glr_advanced_examples_plot_variogram_estimation_modelling.py` # example, by estimating a variogram and fitting a sum of two models. +# Dowd's variogram is used for robustness in conjunction with the NMAD (see :ref:`robuststats-corr`). df_vgm = xdem.spatialstats.sample_empirical_variogram( - values=z_dh.data.squeeze(), gsd=dh.res[0], subsample=300, n_variograms=10, random_state=42 + values=z_dh.data.squeeze(), gsd=dh.res[0], subsample=300, n_variograms=10, estimator="dowd", random_state=42, ) func_sum_vgm, params_vgm = xdem.spatialstats.fit_sum_model_variogram( diff --git a/examples/advanced/plot_variogram_estimation_modelling.py b/examples/advanced/plot_variogram_estimation_modelling.py index 6912c687..230471ec 100644 --- a/examples/advanced/plot_variogram_estimation_modelling.py +++ b/examples/advanced/plot_variogram_estimation_modelling.py @@ -25,6 +25,7 @@ import numpy as np import xdem +from xdem.spatialstats import nmad # %% # We load example files. @@ -75,9 +76,10 @@ # To perform this procedure effectively, we use improved methods that provide efficient pairwise sampling methods for # large grid data in `scikit-gstat `_, which are encapsulated # conveniently by :func:`xdem.spatialstats.sample_empirical_variogram`: +# Dowd's variogram is used for robustness in conjunction with the NMAD (see :ref:`robuststats-corr`). df = xdem.spatialstats.sample_empirical_variogram( - values=dh.data, gsd=dh.res[0], subsample=100, n_variograms=10, random_state=42 + values=dh.data, gsd=dh.res[0], subsample=100, n_variograms=10, estimator="dowd", random_state=42 ) # %% @@ -151,8 +153,8 @@ neff_doublerange = xdem.spatialstats.number_effective_samples(area, params_vgm2) # Convert into a standard error - stderr_singlerange = np.nanstd(dh.data) / np.sqrt(neff_singlerange) - stderr_doublerange = np.nanstd(dh.data) / np.sqrt(neff_doublerange) + stderr_singlerange = nmad(dh.data) / np.sqrt(neff_singlerange) + stderr_doublerange = nmad(dh.data) / np.sqrt(neff_doublerange) list_stderr_singlerange.append(stderr_singlerange) list_stderr_doublerange.append(stderr_doublerange)