From 27bf178d473bafa83d0a2a8f682aac72ca32bc1c Mon Sep 17 00:00:00 2001 From: Robinlovelace Date: Sat, 5 Oct 2024 18:37:42 +0000 Subject: [PATCH] Deploy commit: Update .bib, fix #1140 b24e7f4c90154fd4a0ee2d65b89727c4ed0b3020 --- 15-eco.md | 62 +++++++++++++-------------- 404.html | 2 +- adv-map.html | 2 +- algorithms.html | 2 +- attr.html | 2 +- conclusion.html | 2 +- eco.html | 68 +++++++++++++++--------------- figures/circle-intersection-1.png | Bin 13642 -> 13642 bytes figures/points-1.png | Bin 15698 -> 15698 bytes foreword-1st-edition.html | 2 +- foreword-2nd-edition.html | 2 +- geometry-operations.html | 2 +- gis.html | 2 +- index.html | 4 +- index.md | 4 +- intro.html | 2 +- location.html | 2 +- preface.html | 2 +- raster-vector.html | 2 +- read-write.html | 2 +- references.html | 4 +- reproj-geo-data.html | 2 +- search.json | 2 +- spatial-class.html | 2 +- spatial-cv.html | 2 +- spatial-operations.html | 2 +- transport.html | 2 +- 27 files changed, 91 insertions(+), 91 deletions(-) diff --git a/15-eco.md b/15-eco.md index a652ab872..8c81b5351 100644 --- a/15-eco.md +++ b/15-eco.md @@ -5,7 +5,7 @@ ## Prerequisites {-} This chapter assumes you have a strong grasp of geographic data analysis and processing, covered in Chapters \@ref(spatial-class) to \@ref(geometry-operations). -The chapter makes use of bridges to GIS software, and spatial cross-validation, covered in Chapters \@ref(gis) and \@ref(spatial-cv) respectively. +The chapter makes use of bridges to GIS software, and spatial cross-validation, covered in Chapters \@ref(gis) and \@ref(spatial-cv), respectively. The chapter uses the following packages: @@ -41,13 +41,13 @@ Unfortunately, fog oases are heavily endangered, primarily due to agriculture an Evidence on the composition and spatial distribution of the native flora can support efforts to protect remaining fragments of fog oases [@muenchow_predictive_2013; @muenchow_soil_2013]. In this chapter you will analyze the composition and the spatial distribution of vascular plants (here referring mostly to flowering plants) on the southern slope of Mt. Mongón, a *lomas* mountain near Casma on the central northern coast of Peru (Figure \@ref(fig:study-area-mongon)). -During a field study to Mt. Mongón, all vascular plants living in 100 randomly sampled 4x4 m^2^ plots in the austral winter of 2011 were recorded [@muenchow_predictive_2013]. +During a field study to Mt. Mongón, all vascular plants living in 100 randomly sampled 4*4 m^2^ plots in the austral winter of 2011 were recorded [@muenchow_predictive_2013]. The sampling coincided with a strong La Niña event that year, as shown in data published by the National Oceanic and Atmospheric Administration ([NOAA](https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_v5.php)). This led to even higher levels of aridity than usual in the coastal desert and increased fog activity on the southern slopes of Peruvian *lomas* mountains.
-The Mt. Mongón study area, from Muenchow, Schratz, and Brenning (2017). -

(\#fig:study-area-mongon)The Mt. Mongón study area, from Muenchow, Schratz, and Brenning (2017).

+The Mt. Mongón study area. Figure taken from Muenchow, Schratz, and Brenning (2017). +

(\#fig:study-area-mongon)The Mt. Mongón study area. Figure taken from Muenchow, Schratz, and Brenning (2017).

This chapter also demonstrates how to apply techniques covered in previous chapters to an important applied field: ecology. @@ -165,7 +165,7 @@ ep = qgisprocess::qgis_run_algorithm( ``` This returns a list named `ep` containing the paths to the computed output rasters. -Let's read in catchment area as well as catchment slope into a multi-layer `SpatRaster` object (see Section \@ref(raster-classes)). +Let's read-in catchment area as well as catchment slope into a multi-layer `SpatRaster` object (see Section \@ref(raster-classes)). Additionally, we will add two more raster objects to it, namely `dem` and `ndvi`. @@ -198,7 +198,6 @@ Finally, we can extract the terrain attributes to our field observations (see al ``` r -# terra::extract adds automatically a for our purposes unnecessary ID column ep_rp = terra::extract(ep, random_points, ID = FALSE) random_points = cbind(random_points, ep_rp) ``` @@ -207,21 +206,21 @@ random_points = cbind(random_points, ep_rp) Ordinations\index{ordination} are a popular tool in vegetation science to extract the main information, frequently corresponding to ecological gradients, from large species-plot matrices mostly filled with 0s. However, they are also used in remote sensing\index{remote sensing}, the soil sciences, geomarketing\index{geomarketing} and many other fields. -If you are unfamiliar with ordination\index{ordination} techniques or in need of a refresher, have a look at Michael W. Palmer's [web page](https://ordination.okstate.edu/overview.htm) for a short introduction to popular ordination techniques in ecology and at @borcard_numerical_2011 for a deeper look on how to apply these techniques in R. +If you are unfamiliar with ordination\index{ordination} techniques or in need of a refresher, have a look at [Michael W. Palmer's webpage](https://ordination.okstate.edu/overview.htm) for a short introduction to popular ordination techniques in ecology and at @borcard_numerical_2011 for a deeper look on how to apply these techniques in R. **vegan**'s\index{vegan (package)} package documentation is also a very helpful resource (`vignette(package = "vegan")`). Principal component analysis (PCA\index{PCA}) is probably the most famous ordination\index{ordination} technique. It is a great tool to reduce dimensionality if one can expect linear relationships between variables, and if the joint absence of a variable in two plots (observations) can be considered a similarity. This is barely the case with vegetation data. -For one, the presence of a plant often follows a unimodal, i.e. a non-linear, relationship along a gradient (e.g., humidity, temperature or salinity) with a peak at the most favorable conditions and declining ends towards the unfavorable conditions. +For one, the presence of a plant often follows a unimodal, i.e., a non-linear, relationship along a gradient (e.g., humidity, temperature or salinity) with a peak at the most favorable conditions and declining ends towards the unfavorable conditions. Secondly, the joint absence of a species in two plots is hardly an indication for similarity. -Suppose a plant species is absent from the driest (e.g., an extreme desert) and the most moistest locations (e.g., a tree savanna) of our sampling. +Suppose a plant species is absent from the driest (e.g., an extreme desert) and the moistest locations (e.g., a tree savanna) of our sampling. Then we really should refrain from counting this as a similarity because it is very likely that the only thing these two completely different environmental settings have in common in terms of floristic composition is the shared absence of species (except for rare ubiquitous species). -Non-metric multidimensional scaling (NMDS\index{NMDS}) is one popular dimension-reducing technique used in ecology [@vonwehrden_pluralism_2009]. -NMDS\index{NMDS} reduces the rank-based differences between the distances between objects in the original matrix and distances between the ordinated objects. +NMDS\index{NMDS} is one popular dimension-reducing technique used in ecology [@vonwehrden_pluralism_2009]. +NMDS\index{NMDS} reduces the rank-based differences between distances between objects in the original matrix and distances between the ordinated objects. The difference is expressed as stress. The lower the stress value, the better the ordination, i.e., the low-dimensional representation of the original matrix. Stress values lower than 10 represent an excellent fit, stress values of around 15 are still good, and values greater than 20 represent a poor fit [@mccune_analysis_2002]. @@ -295,7 +294,7 @@ plot(y = sc[, 1], x = elev, xlab = "elevation in m", The scores of the first NMDS\index{NMDS} axis represent the different vegetation formations, i.e., the floristic gradient, appearing along the slope of Mt. Mongón. -To spatially visualize them, we can model the NMDS\index{NMDS} scores with the previously created predictors (Section \@ref(data-and-data-preparation)), and use the resulting model for predictive mapping (see next section). +To spatially visualize them, we can model the NMDS\index{NMDS} scores with the previously created predictors (Section \@ref(data-and-data-preparation)), and use the resulting model for predictive mapping (see Section \@ref(modeling-the-floristic-gradient)). ## Modeling the floristic gradient @@ -336,8 +335,8 @@ The first internal node at the top of the tree assigns all observations which ar The observations falling into the left branch have a mean NMDS\index{NMDS} score of -1.198. Overall, we can interpret the tree as follows: the higher the elevation, the higher the NMDS\index{NMDS} score becomes. This means that the simple decision tree has already revealed four distinct floristic assemblages. -For a more in-depth interpretation please refer to the \@ref(predictive-mapping) section. -Decision trees have a tendency to overfit\index{overfitting}, that is they mirror too closely the input data including its noise which in turn leads to bad predictive performances (Section \@ref(intro-cv); @james_introduction_2013). +For a more in-depth interpretation, please refer to Section \@ref(predictive-mapping). +Decision trees have a tendency to overfit\index{overfitting}, that is, they mirror too closely the input data including its noise which in turn leads to bad predictive performances [Section \@ref(intro-cv); @james_introduction_2013]. Bootstrap aggregation (bagging) is an ensemble technique that can help to overcome this problem. Ensemble techniques simply combine the predictions of multiple models. Thus, bagging takes repeated samples from the same input data and averages the predictions. @@ -358,10 +357,10 @@ The code in this section largely follows the steps we have introduced in Section The only differences are the following: 1. The response variable is numeric, hence a regression\index{regression} task will replace the classification\index{classification} task of Section \@ref(svm) -1. Instead of the AUROC\index{AUROC} which can only be used for categorical response variables, we will use the root mean squared error (RMSE\index{RMSE}) as performance measure -1. We use a random forest\index{random forest} model instead of a support vector machine\index{SVM} which naturally goes along with different hyperparameters\index{hyperparameter} +1. Instead of the AUROC\index{AUROC} which can only be used for categorical response variables, we will use the root mean squared error (RMSE\index{RMSE}) as the performance measure +1. We use a Random Forest\index{random forest} model instead of a Support Vector Machine\index{SVM} which naturally goes along with different hyperparameters\index{hyperparameter} 1. We are leaving the assessment of a bias-reduced performance measure as an exercise to the reader (see Exercises). -Instead we show how to tune hyperparameters\index{hyperparameter} for (spatial) predictions +Instead, we show how to tune hyperparameters\index{hyperparameter} for (spatial) predictions Remember that 125,500 models were necessary to retrieve bias-reduced performance estimates when using 100-repeated 5-fold spatial cross-validation\index{cross-validation!spatial CV} and a random search of 50 iterations in Section \@ref(svm). In the hyperparameter\index{hyperparameter} tuning level, we found the best hyperparameter combination which in turn was used in the outer performance level for predicting the test data of a specific spatial partition (see also Figure \@ref(fig:inner-outer)). @@ -370,12 +369,12 @@ Which one should we use for making spatial distribution maps? The answer is simple: none at all. Remember, the tuning was done to retrieve a bias-reduced performance estimate, not to do the best possible spatial prediction. For the latter, one estimates the best hyperparameter\index{hyperparameter} combination from the complete dataset. -This means, the inner hyperparameter\index{hyperparameter} tuning level is no longer needed which makes perfect sense since we are applying our model to new data (unvisited field observations) for which the true outcomes are unavailable, hence testing is impossible in any case. +This means, the inner hyperparameter\index{hyperparameter} tuning level is no longer needed, which makes perfect sense since we are applying our model to new data (unvisited field observations) for which the true outcomes are unavailable, hence testing is impossible in any case. Therefore, we tune the hyperparameters\index{hyperparameter} for a good spatial prediction on the complete dataset via a 5-fold spatial CV\index{cross-validation!spatial CV} with one repetition. Having already constructed the input variables (`rp`), we are all set for specifying the **mlr3**\index{mlr3 (package)} building blocks (task, learner, and resampling). -For specifying a spatial task, we use again the **mlr3spatiotempcv** package [@schratz_mlr3spatiotempcv_2021 & Section \@ref(spatial-cv-with-mlr3)], and since our response (`sc`) is numeric, we use a regression\index{regression} task. +For specifying a spatial task, we use again the **mlr3spatiotempcv** package [@schratz_mlr3spatiotempcv_2021 Section \@ref(spatial-cv-with-mlr3)], and since our response (`sc`) is numeric, we use a regression\index{regression} task. ``` r @@ -388,15 +387,15 @@ task = mlr3spatiotempcv::as_task_regr_st( ``` Using an `sf` object as the backend automatically provides the geometry information needed for the spatial partitioning later on. -Additionally, we got rid of the columns `id` and `spri` since these variables should not be used as predictors in the modeling. -Next, we go on to construct the a random forest\index{random forest} learner from the **ranger** package [@wright_ranger_2017]. +Additionally, we got rid of the columns `id` and `spri`, since these variables should not be used as predictors in the modeling. +Next, we go on to construct a random forest\index{random forest} learner from the **ranger** package [@wright_ranger_2017]. ``` r lrn_rf = lrn("regr.ranger", predict_type = "response") ``` -As opposed to, for example, support vector machines\index{SVM} (see Section \@ref(svm)), random forests often already show good performances when used with the default values of their hyperparameters (which may be one reason for their popularity). +As opposed to, for example, Support Vector Machines\index{SVM} (see Section \@ref(svm)), random forests often already show good performances when used with the default values of their hyperparameters (which may be one reason for their popularity). Still, tuning often moderately improves model results, and thus is worth the effort [@probst_hyperparameters_2018]. In random forests\index{random forest}, the hyperparameters\index{hyperparameter} `mtry`, `min.node.size` and `sample.fraction` determine the degree of randomness, and should be tuned [@probst_hyperparameters_2018]. `mtry` indicates how many predictor variables should be used in each tree. @@ -421,7 +420,7 @@ search_space = paradox::ps( Having defined the search space, we are all set for specifying our tuning via the `AutoTuner()` function. Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr3)). -Specifically, we will use a five-fold spatial partitioning with only one repetition (`rsmp()`). +Specifically, we will use a 5-fold spatial partitioning with only one repetition (`rsmp()`). In each of these spatial partitions, we run 50 models (`trm()`) while using randomly selected hyperparameter configurations (`tnr()`) within predefined limits (`seach_space`) to find the optimal hyperparameter\index{hyperparameter} combination [see also Section \@ref(svm) and https://mlr3book.mlr-org.com/chapters/chapter4/hyperparameter_optimization.html#sec-autotuner, @bischl_applied_2024]. The performance measure is the root mean squared error (RMSE\index{RMSE}). @@ -510,13 +509,13 @@ all(values(pred - pred_2) == 0) The predictive mapping clearly reveals distinct vegetation belts (Figure \@ref(fig:rf-pred)). Please refer to @muenchow_soil_2013 for a detailed description of vegetation belts on *lomas* mountains. -The blue color tones represent the so-called *Tillandsia*-belt. +The blue color tones represent the so-called *Tillandsia* belt. *Tillandsia* is a highly adapted genus especially found in high quantities at the sandy and quite desertic foot of *lomas* mountains. -The yellow color tones refer to a herbaceous vegetation belt with a much higher plant cover compared to the *Tillandsia*-belt. +The yellow color tones refer to a herbaceous vegetation belt with a much higher plant cover compared to the *Tillandsia* belt. The orange colors represent the bromeliad belt, which features the highest species richness and plant cover. It can be found directly beneath the temperature inversion (ca. 750-850 m asl) where humidity due to fog is highest. Water availability naturally decreases above the temperature inversion, and the landscape becomes desertic again with only a few succulent species (succulent belt; red colors). -Interestingly, the spatial prediction clearly reveals that the bromeliad belt is interrupted which is a very interesting finding we would have not detected without the predictive mapping. +Interestingly, the spatial prediction clearly reveals that the bromeliad belt is interrupted, which is a very interesting finding we would have not detected without the predictive mapping. ## Conclusions @@ -530,15 +529,16 @@ Since *lomas* mountains are heavily endangered, the prediction map can serve as In terms of methodology, a few additional points could be addressed: - It would be interesting to also model the second ordination\index{ordination} axis, and to subsequently find an innovative way of visualizing jointly the modeled scores of the two axes in one prediction map -- If we were interested in interpreting the model in an ecologically meaningful way, we should probably use (semi-)parametric models [@muenchow_predictive_2013;@zuur_mixed_2009;@zuur_beginners_2017]/ +- If we were interested in interpreting the model in an ecologically meaningful way, we should probably use (semi-)parametric models [@muenchow_predictive_2013;@zuur_mixed_2009;@zuur_beginners_2017]. However, there are at least approaches that help to interpret machine learning models such as random forests\index{random forest} (see, e.g., [https://mlr-org.com/posts/2018-04-30-interpretable-machine-learning-iml-and-mlr/](https://mlr-org.com/posts/2018-04-30-interpretable-machine-learning-iml-and-mlr/)) - A sequential model-based optimization (SMBO) might be preferable to the random search for hyperparameter\index{hyperparameter} optimization used in this chapter [@probst_hyperparameters_2018] -Finally, please note that random forest\index{random forest} and other machine learning\index{machine learning} models are frequently used in a setting with lots of observations and many predictors, much more than used in this chapter, and where it is unclear which variables and variable interactions contribute to explaining the response. +Finally, please note that random forest\index{random forest} and other machine learning\index{machine learning} models are frequently used in a setting with much more observations and predictors than used in this Chapter, and where it is unclear which variables and variable interactions contribute to explaining the response. Additionally, the relationships might be highly non-linear. -In our use case, the relationship between response and predictors are pretty clear, there is only a slight amount of non-linearity and the number of observations and predictors is low. +In our use case, the relationship between response and predictors is pretty clear, there is only a slight amount of non-linearity and the number of observations and predictors is low. Hence, it might be worth trying a linear model\index{regression!linear}. -A linear model is much easier to explain and understand than a random forest\index{random forest} model, and therefore to be preferred (law of parsimony), additionally it is computationally less demanding (see Exercises). +A linear model is much easier to explain and understand than a random forest\index{random forest} model, and therefore preferred (law of parsimony). +Additionally it is computationally less demanding (see Exercises). If the linear model cannot cope with the degree of non-linearity present in the data, one could also try a generalized additive model\index{generalized additive model} (GAM). The point here is that the toolbox of a data scientist consists of more than one tool, and it is your responsibility to select the tool best suited for the task or purpose at hand. Here, we wanted to introduce the reader to random forest\index{random forest} modeling and how to use the corresponding results for predictive mapping purposes. @@ -573,4 +573,4 @@ E3. Retrieve the bias-reduced RMSE of a random forest\index{random forest} and a The random forest modeling should include the estimation of optimal hyperparameter\index{hyperparameter} combinations (random search with 50 iterations) in an inner tuning loop. Parallelize\index{parallelization} the tuning level. Report the mean RMSE\index{RMSE} and use a boxplot to visualize all retrieved RMSEs. -Please not that this exercise is best solved using the mlr3 functions `benchmark_grid()` and `benchmark()` (see https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking for more information). +Please note that this exercise is best solved using the mlr3 functions `benchmark_grid()` and `benchmark()` (see https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking for more information). diff --git a/404.html b/404.html index 2277f3114..71d123ec4 100644 --- a/404.html +++ b/404.html @@ -129,7 +129,7 @@

Second Edition