Skip to content

Commit

Permalink
Use NMAD consistently in examples and clarify link to Dowd's variogra…
Browse files Browse the repository at this point in the history
…m in doc (#390)

* Clarify Dowds variogram synergy with NMAD and put link into advanced examples

* Add exception for Dowds variogram for scikit-gstat version lower or equal to 1.0.0

* Linting

* Fix hyperlink in doc

* Linting
  • Loading branch information
rhugonnet authored Aug 2, 2023
1 parent db5e57b commit df2ffe2
Show file tree
Hide file tree
Showing 4 changed files with 57 additions and 36 deletions.
59 changes: 30 additions & 29 deletions doc/source/intro_robuststats.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,17 @@

# The need for robust statistics

Digital Elevation Models often contain outliers that hamper further analysis.
In order to mitigate their effect on DEM analysis, xDEM integrates [robust statistics](https://en.wikipedia.org/wiki/Robust_statistics)
methods at different levels.
These methods can be used to robustly fit functions necessary to perform DEM alignment (see {ref}`coregistration`, {ref}`biascorr`), or to provide
robust statistical measures equivalent to the mean, the standard deviation or the covariance of a sample when analyzing DEM precision with
{ref}`spatialstats`.
Digital elevation models often contain outliers that can be traced back to instrument acquisition or processing artefacts, and which hamper further analysis.

In order to mitigate their effect, xDEM integrates [robust statistics](https://en.wikipedia.org/wiki/Robust_statistics) at different levels:
- Robust optimizers for the fitting of parametric models during {ref}`coregistration` and {ref}`biascorr`,
- Robust measures for the central tendency (e.g., mean) and dispersion (e.g., standard deviation), to evaluate DEM quality and converge during {ref}`coregistration`,
- Robust measures for estimating spatial autocorrelation for uncertainty analysis in {ref}`spatialstats`.

Yet, there is a downside to robust statistical measures. Those can yield less precise estimates for small samples sizes and,
in some cases, hide patterns inherent to the data. This is why, when outliers exhibit idenfiable patterns, it is better
in some cases, hide patterns inherent to the data. This is why, when outliers show identifiable patterns, it is better
to first resort to outlier filtering (see {ref}`filters`) and perform analysis using traditional statistical measures.

```{contents} Contents
:local: true
```

(robuststats-meanstd)=

## Measures of central tendency and dispersion
Expand All @@ -34,12 +30,15 @@ used as a robust measure of central tendency.

The median is used by default in the alignment routines of {ref}`coregistration` and {ref}`biascorr`.

(robuststats-nmad)=

### Dispersion

The [statistical dispersion](https://en.wikipedia.org/wiki/Statistical_dispersion) represents the spread of a sample,
and is core to the analysis of sample precision (see {ref}`intro`). It is typically measured by the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation).
However, very much like the mean, the standard deviation is a measure sensitive to outliers. The median equivalent of a
standard deviation is the normalized median absolute deviation (NMAD), which corresponds to the [median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) scaled by a factor of ~1.4826 to match the dispersion of a
However, very much like the mean, the standard deviation is a measure sensitive to outliers.

The median equivalent of a standard deviation is the normalized median absolute deviation (NMAD), which corresponds to the [median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) scaled by a factor of ~1.4826 to match the dispersion of a
normal distribution. It has been shown to provide more robust measures of dispersion with outliers when working
with DEMs (e.g., [Höhle and Höhle (2009)](https://doi.org/10.1016/j.isprsjprs.2009.02.003)).
It is defined as:
Expand All @@ -50,35 +49,29 @@ $$

where $x$ is the sample.

The half difference between 84{sup}`th` and 16{sup}`th` percentiles, or the absolute 68{sup}`th` percentile
can also be used as a robust dispersion measure equivalent to the standard deviation.

```python
nmad = xdem.spatialstats.nmad
```

```{note}
The NMAD estimator has a good synergy with {ref}`Dowd's variogram<robuststats-corr>` for spatial autocorrelation, as their median-based measure of dispersion is the same.
```

The half difference between 84{sup}`th` and 16{sup}`th` percentiles, or the absolute 68{sup}`th` percentile
can also be used as a robust dispersion measure equivalent to the standard deviation.
When working with weighted data, the difference between the 84{sup}`th` and 16{sup}`th` [weighted percentile](https://en.wikipedia.org/wiki/Percentile#Weighted_percentile), or the absolute 68{sup}`th` weighted percentile can be used as a robust measure of dispersion.

```{important}
The NMAD is used by default for estimating elevation measurement errors in {ref}`spatialstats`.
```

(robuststats-corr)=

## Measures of correlation

### Correlation between samples

The [covariance](https://en.wikipedia.org/wiki/Covariance) is the measure generally used to estimate the joint variability
of samples, often normalized to a [correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).
Again, the variance and covariance are sensitive measures to outliers. It is therefore preferable to compute such measures
by filtering the data, or using robust estimators.

TODO

### Spatial auto-correlation of a sample
## Measures of spatial autocorrelation

[Variogram](https://en.wikipedia.org/wiki/Variogram) analysis exploits statistical measures equivalent to the covariance,
and is therefore also subject to outliers.
Based on [scikit-gstat](https://mmaelicke.github.io/scikit-gstat/index.html), xDEM allows to specify robust variogram
Based on [SciKit-GStat](https://mmaelicke.github.io/scikit-gstat/index.html), xDEM allows to specify robust variogram
estimators such as Dowd's variogram based on medians ([Dowd (1984)](https://en.wikipedia.org/wiki/Variogram)) defined as:

$$
Expand All @@ -87,7 +80,15 @@ $$

where $h$ is the spatial lag and $Z_{x_{i}}$ is the value of the sample at the location $x_{i}$.

```{note}
Dowd's estimator has a good synergy with the {ref}`NMAD<robuststats-nmad>` for estimating the dispersion of the full sample, as their median-based measure of dispersion is the same (2.198 is the square of 1.4826).
```

Other estimators can be chosen from [SciKit-GStat's list of estimators](https://scikit-gstat.readthedocs.io/en/latest/reference/estimator.html).

```{important}
Dowd's variogram is used by default to estimate spatial auto-correlation of elevation measurement errors in {ref}`spatialstats`.
```

(robuststats-regression)=

Expand Down
15 changes: 11 additions & 4 deletions examples/advanced/plot_standardization.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,11 +98,12 @@
z_dh.data[np.abs(z_dh.data) > 4] = np.nan

# %%
# We perform a scale-correction for the standardization, to ensure that the standard deviation of the data is exactly 1.
print(f"Standard deviation before scale-correction: {nmad(z_dh.data):.1f}")
# We perform a scale-correction for the standardization, to ensure that the spread of the data is exactly 1.
# The NMAD is used as a robust measure for the spread (see :ref:`robuststats-nmad`).
print(f"NMAD before scale-correction: {nmad(z_dh.data):.1f}")
scale_fac_std = nmad(z_dh.data)
z_dh = z_dh / scale_fac_std
print(f"Standard deviation after scale-correction: {nmad(z_dh.data):.1f}")
print(f"NMAD after scale-correction: {nmad(z_dh.data):.1f}")

plt.figure(figsize=(8, 5))
plt_extent = [
Expand All @@ -123,8 +124,14 @@
# %%
# Now, we can perform an analysis of spatial correlation as shown in the :ref:`sphx_glr_advanced_examples_plot_variogram_estimation_modelling.py`
# example, by estimating a variogram and fitting a sum of two models.
# Dowd's variogram is used for robustness in conjunction with the NMAD (see :ref:`robuststats-corr`).
df_vgm = xdem.spatialstats.sample_empirical_variogram(
values=z_dh.data.squeeze(), gsd=dh.res[0], subsample=300, n_variograms=10, random_state=42
values=z_dh.data.squeeze(),
gsd=dh.res[0],
subsample=300,
n_variograms=10,
estimator="dowd",
random_state=42,
)

func_sum_vgm, params_vgm = xdem.spatialstats.fit_sum_model_variogram(
Expand Down
8 changes: 5 additions & 3 deletions examples/advanced/plot_variogram_estimation_modelling.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import numpy as np

import xdem
from xdem.spatialstats import nmad

# %%
# We load example files.
Expand Down Expand Up @@ -75,9 +76,10 @@
# To perform this procedure effectively, we use improved methods that provide efficient pairwise sampling methods for
# large grid data in `scikit-gstat <https://mmaelicke.github.io/scikit-gstat/index.html>`_, which are encapsulated
# conveniently by :func:`xdem.spatialstats.sample_empirical_variogram`:
# Dowd's variogram is used for robustness in conjunction with the NMAD (see :ref:`robuststats-corr`).

df = xdem.spatialstats.sample_empirical_variogram(
values=dh.data, gsd=dh.res[0], subsample=100, n_variograms=10, random_state=42
values=dh.data, gsd=dh.res[0], subsample=100, n_variograms=10, estimator="dowd", random_state=42
)

# %%
Expand Down Expand Up @@ -151,8 +153,8 @@
neff_doublerange = xdem.spatialstats.number_effective_samples(area, params_vgm2)

# Convert into a standard error
stderr_singlerange = np.nanstd(dh.data) / np.sqrt(neff_singlerange)
stderr_doublerange = np.nanstd(dh.data) / np.sqrt(neff_doublerange)
stderr_singlerange = nmad(dh.data) / np.sqrt(neff_singlerange)
stderr_doublerange = nmad(dh.data) / np.sqrt(neff_doublerange)
list_stderr_singlerange.append(stderr_singlerange)
list_stderr_doublerange.append(stderr_doublerange)

Expand Down
11 changes: 11 additions & 0 deletions xdem/spatialstats.py
Original file line number Diff line number Diff line change
Expand Up @@ -1520,6 +1520,17 @@ def sample_empirical_variogram(
df_mean["count"] = df_count["count"]
df = df_mean

# Fix variance error for Dowd's variogram in SciKit-GStat

# If skgstat > 1.0, we can use Dowd's without correcting, otherwise we correct
from packaging.version import Version

if Version(skg.__version__) <= Version("1.0.0"):
if "estimator" in kwargs.keys() and kwargs["estimator"] == "dowd":
# Correction: we divide all experimental variance values by 2
df.exp.values /= 2
df.err_exp.values /= 2

# Remove the last spatial lag bin which is always undersampled
df.drop(df.tail(1).index, inplace=True)

Expand Down

0 comments on commit df2ffe2

Please sign in to comment.