Use NMAD consistently in examples and clarify link to Dowd's variogra…

…m in doc (#390) * Clarify Dowds variogram synergy with NMAD and put link into advanced examples * Add exception for Dowds variogram for scikit-gstat version lower or equal to 1.0.0 * Linting * Fix hyperlink in doc * Linting
GlacioHack · Aug 2, 2023 · df2ffe2 · df2ffe2
1 parent db5e57b
commit df2ffe2
Show file tree

Hide file tree

Showing 4 changed files with 57 additions and 36 deletions.
diff --git a/doc/source/intro_robuststats.md b/doc/source/intro_robuststats.md
@@ -2,21 +2,17 @@
 
 # The need for robust statistics
 
-Digital Elevation Models often contain outliers that hamper further analysis.
-In order to mitigate their effect on DEM analysis, xDEM integrates [robust statistics](https://en.wikipedia.org/wiki/Robust_statistics)
-methods at different levels.
-These methods can be used to robustly fit functions necessary to perform DEM alignment (see {ref}`coregistration`, {ref}`biascorr`), or to provide
-robust statistical measures equivalent to the mean, the standard deviation or the covariance of a sample when analyzing DEM precision with
-{ref}`spatialstats`.
+Digital elevation models often contain outliers that can be traced back to instrument acquisition or processing artefacts, and which hamper further analysis.
+
+In order to mitigate their effect, xDEM integrates [robust statistics](https://en.wikipedia.org/wiki/Robust_statistics) at different levels:
+- Robust optimizers for the fitting of parametric models during {ref}`coregistration` and {ref}`biascorr`,
+- Robust measures for the central tendency (e.g., mean) and dispersion (e.g., standard deviation), to evaluate DEM quality and converge during {ref}`coregistration`,
+- Robust measures for estimating spatial autocorrelation for uncertainty analysis in {ref}`spatialstats`.
 
 Yet, there is a downside to robust statistical measures. Those can yield less precise estimates for small samples sizes and,
-in some cases, hide patterns inherent to the data. This is why, when outliers exhibit idenfiable patterns, it is better
+in some cases, hide patterns inherent to the data. This is why, when outliers show identifiable patterns, it is better
 to first resort to outlier filtering (see {ref}`filters`) and perform analysis using traditional statistical measures.
 
-```{contents} Contents
-:local: true
-```
-
 (robuststats-meanstd)=
 
 ## Measures of central tendency and dispersion
@@ -34,12 +30,15 @@ used as a robust measure of central tendency.
 
 The median is used by default in the alignment routines of {ref}`coregistration` and {ref}`biascorr`.
 
+(robuststats-nmad)=
+
 ### Dispersion
 
 The [statistical dispersion](https://en.wikipedia.org/wiki/Statistical_dispersion) represents the spread of a sample,
 and is core to the analysis of sample precision (see {ref}`intro`). It is typically measured by the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation).
-However, very much like the mean, the standard deviation is a measure sensitive to outliers. The median equivalent of a
-standard deviation is the normalized median absolute deviation (NMAD), which corresponds to the [median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) scaled by a factor of ~1.4826 to match the dispersion of a
+However, very much like the mean, the standard deviation is a measure sensitive to outliers.
+
+The median equivalent of a standard deviation is the normalized median absolute deviation (NMAD), which corresponds to the [median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) scaled by a factor of ~1.4826 to match the dispersion of a
 normal distribution. It has been shown to provide more robust measures of dispersion with outliers when working
 with DEMs (e.g., [Höhle and Höhle (2009)](https://doi.org/10.1016/j.isprsjprs.2009.02.003)).
 It is defined as:
@@ -50,35 +49,29 @@ $$
 
 where $x$ is the sample.
 
-The half difference between 84{sup}`th` and 16{sup}`th` percentiles, or the absolute 68{sup}`th` percentile
-can also be used as a robust dispersion measure equivalent to the standard deviation.
-
 ```python
 nmad = xdem.spatialstats.nmad
 ```
 
+```{note}
+The NMAD estimator has a good synergy with {ref}`Dowd's variogram<robuststats-corr>` for spatial autocorrelation, as their median-based measure of dispersion is the same.
+```
+
+The half difference between 84{sup}`th` and 16{sup}`th` percentiles, or the absolute 68{sup}`th` percentile
+can also be used as a robust dispersion measure equivalent to the standard deviation.
 When working with weighted data, the difference between the 84{sup}`th` and 16{sup}`th` [weighted percentile](https://en.wikipedia.org/wiki/Percentile#Weighted_percentile), or the absolute 68{sup}`th` weighted percentile can be used as a robust measure of dispersion.
 
+```{important}
 The NMAD is used by default for estimating elevation measurement errors in {ref}`spatialstats`.
+```
 
 (robuststats-corr)=
 
-## Measures of correlation
-
-### Correlation between samples
-
-The [covariance](https://en.wikipedia.org/wiki/Covariance) is the measure generally used to estimate the joint variability
-of samples, often normalized to a [correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).
-Again, the variance and covariance are sensitive measures to outliers. It is therefore preferable to compute such measures
-by filtering the data, or using robust estimators.
-
-TODO
-
-### Spatial auto-correlation of a sample
+## Measures of spatial autocorrelation
 
 [Variogram](https://en.wikipedia.org/wiki/Variogram) analysis exploits statistical measures equivalent to the covariance,
 and is therefore also subject to outliers.
-Based on [scikit-gstat](https://mmaelicke.github.io/scikit-gstat/index.html), xDEM allows to specify robust variogram
+Based on [SciKit-GStat](https://mmaelicke.github.io/scikit-gstat/index.html), xDEM allows to specify robust variogram
 estimators such as Dowd's variogram based on medians ([Dowd (1984)](https://en.wikipedia.org/wiki/Variogram)) defined as:
 
 $$
@@ -87,7 +80,15 @@ $$
 
 where $h$ is the spatial lag and $Z_{x_{i}}$ is the value of the sample at the location $x_{i}$.
 
+```{note}
+Dowd's estimator has a good synergy with the {ref}`NMAD<robuststats-nmad>` for estimating the dispersion of the full sample, as their median-based measure of dispersion is the same (2.198 is the square of 1.4826).
+```
+
+Other estimators can be chosen from [SciKit-GStat's list of estimators](https://scikit-gstat.readthedocs.io/en/latest/reference/estimator.html).
+
+```{important}
 Dowd's variogram is used by default to estimate spatial auto-correlation of elevation measurement errors in {ref}`spatialstats`.
+```
 
 (robuststats-regression)=
 

diff --git a/examples/advanced/plot_standardization.py b/examples/advanced/plot_standardization.py
@@ -98,11 +98,12 @@
 z_dh.data[np.abs(z_dh.data) > 4] = np.nan
 
 # %%
-# We perform a scale-correction for the standardization, to ensure that the standard deviation of the data is exactly 1.
-print(f"Standard deviation before scale-correction: {nmad(z_dh.data):.1f}")
+# We perform a scale-correction for the standardization, to ensure that the spread of the data is exactly 1.
+# The NMAD is used as a robust measure for the spread (see :ref:`robuststats-nmad`).
+print(f"NMAD before scale-correction: {nmad(z_dh.data):.1f}")
 scale_fac_std = nmad(z_dh.data)
 z_dh = z_dh / scale_fac_std
-print(f"Standard deviation after scale-correction: {nmad(z_dh.data):.1f}")
+print(f"NMAD after scale-correction: {nmad(z_dh.data):.1f}")
 
 plt.figure(figsize=(8, 5))
 plt_extent = [
@@ -123,8 +124,14 @@
 # %%
 # Now, we can perform an analysis of spatial correlation as shown in the :ref:`sphx_glr_advanced_examples_plot_variogram_estimation_modelling.py`
 # example, by estimating a variogram and fitting a sum of two models.
+# Dowd's variogram is used for robustness in conjunction with the NMAD (see :ref:`robuststats-corr`).
 df_vgm = xdem.spatialstats.sample_empirical_variogram(
-    values=z_dh.data.squeeze(), gsd=dh.res[0], subsample=300, n_variograms=10, random_state=42
+    values=z_dh.data.squeeze(),
+    gsd=dh.res[0],
+    subsample=300,
+    n_variograms=10,
+    estimator="dowd",
+    random_state=42,
 )
 
 func_sum_vgm, params_vgm = xdem.spatialstats.fit_sum_model_variogram(

diff --git a/examples/advanced/plot_variogram_estimation_modelling.py b/examples/advanced/plot_variogram_estimation_modelling.py
@@ -25,6 +25,7 @@
 import numpy as np
 
 import xdem
+from xdem.spatialstats import nmad
 
 # %%
 # We load example files.
@@ -75,9 +76,10 @@
 # To perform this procedure effectively, we use improved methods that provide efficient pairwise sampling methods for
 # large grid data in `scikit-gstat <https://mmaelicke.github.io/scikit-gstat/index.html>`_, which are encapsulated
 # conveniently by :func:`xdem.spatialstats.sample_empirical_variogram`:
+# Dowd's variogram is used for robustness in conjunction with the NMAD (see :ref:`robuststats-corr`).
 
 df = xdem.spatialstats.sample_empirical_variogram(
-    values=dh.data, gsd=dh.res[0], subsample=100, n_variograms=10, random_state=42
+    values=dh.data, gsd=dh.res[0], subsample=100, n_variograms=10, estimator="dowd", random_state=42
 )
 
 # %%
@@ -151,8 +153,8 @@
     neff_doublerange = xdem.spatialstats.number_effective_samples(area, params_vgm2)
 
     # Convert into a standard error
-    stderr_singlerange = np.nanstd(dh.data) / np.sqrt(neff_singlerange)
-    stderr_doublerange = np.nanstd(dh.data) / np.sqrt(neff_doublerange)
+    stderr_singlerange = nmad(dh.data) / np.sqrt(neff_singlerange)
+    stderr_doublerange = nmad(dh.data) / np.sqrt(neff_doublerange)
     list_stderr_singlerange.append(stderr_singlerange)
     list_stderr_doublerange.append(stderr_doublerange)
 

diff --git a/xdem/spatialstats.py b/xdem/spatialstats.py
@@ -1520,6 +1520,17 @@ def sample_empirical_variogram(
         df_mean["count"] = df_count["count"]
         df = df_mean
 
+    # Fix variance error for Dowd's variogram in SciKit-GStat
+
+    # If skgstat > 1.0, we can use Dowd's without correcting, otherwise we correct
+    from packaging.version import Version
+
+    if Version(skg.__version__) <= Version("1.0.0"):
+        if "estimator" in kwargs.keys() and kwargs["estimator"] == "dowd":
+            # Correction: we divide all experimental variance values by 2
+            df.exp.values /= 2
+            df.err_exp.values /= 2
+
     # Remove the last spatial lag bin which is always undersampled
     df.drop(df.tail(1).index, inplace=True)