forked from geocompx/geocompr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
_15-ex.Rmd
243 lines (222 loc) · 9.64 KB
/
_15-ex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
The solutions assume the following packages are attached (other packages will be attached when needed):
```{r 15-ex-e0, message=FALSE, warning=FALSE}
library(sf)
library(terra)
library(data.table)
library(dplyr)
library(future)
library(ggplot2)
library(lgr)
library(mlr3)
library(mlr3learners)
library(mlr3spatiotempcv)
library(mlr3tuning)
library(mlr3viz)
library(progressr)
library(qgisprocess)
library(tictoc)
library(vegan)
```
E1. Run a NMDS\index{NMDS} using the percentage data of the community matrix.
Report the stress value and compare it to the stress value as retrieved from the NMDS using presence-absence data.
What might explain the observed difference?
```{r 15-ex-e1, message=FALSE}
data("comm", package = "spDataLarge")
pa = vegan::decostand(comm, "pa")
pa = pa[rowSums(pa) != 0, ]
comm = comm[rowSums(comm) != 0, ]
set.seed(25072018)
nmds_pa = vegan::metaMDS(comm = pa, k = 4, try = 500)
nmds_per = vegan::metaMDS(comm = comm, k = 4, try = 500)
nmds_pa$stress
nmds_per$stress
```
```{asis 15-ex-e1-asis, message=FALSE}
The NMDS using the presence-absence values yields a better result (`nmds_pa$stress`) than the one using percentage data (`nmds_per$stress`).
This might seem surprising at first sight.
On the other hand, the percentage matrix contains both more information and more noise.
Another aspect is how the data was collected.
Imagine a botanist in the field.
It might seem feasible to differentiate between a plant which has a cover of 5% and another species that covers 10%.
However, what about a herbal species that was only detected three times and consequently has a very tiny cover, e.g., 0.0001%.
Maybe another herbal species was detected 6 times, is its cover then 0.0002%?
The point here is that percentage data as specified during a field campaign might reflect a precision that the data does not have.
This again introduces noise which in turn will worsen the ordination result.
Still, it is a valuable information if one species had a higher frequency or coverage in one plot than another compared to just presence-absence data.
One compromise would be to use a categorical scale such as the Londo scale.
```
E2. Compute all the predictor rasters\index{raster} we have used in the chapter (catchment slope, catchment area), and put them into a `SpatRaster`-object.
Add `dem` and `ndvi` to it.
Next, compute profile and tangential curvature and add them as additional predictor rasters (hint: `grass7:r.slope.aspect`).
Finally, construct a response-predictor matrix.
The scores of the first NMDS\index{NMDS} axis (which were the result when using the presence-absence community matrix) rotated in accordance with elevation represent the response variable, and should be joined to `random_points` (use an inner join).
To complete the response-predictor matrix, extract the values of the environmental predictor raster object to `random_points`.
```{r 15-ex-e2, eval = FALSE}
# first compute the terrain attributes we have also used in the chapter
library(dplyr)
library(terra)
library(qgisprocess)
library(vegan)
data("comm", "random_points", package = "spDataLarge")
dem = terra::rast(system.file("raster/dem.tif", package = "spDataLarge"))
ndvi = terra::rast(system.file("raster/ndvi.tif", package = "spDataLarge"))
# use presence-absence matrix and get rid of empty
pa = vegan::decostand(comm, "pa")
pa = pa[rowSums(pa) != 0, ]
# compute environmental predictors (ep) catchment slope and catchment area
ep = qgisprocess::qgis_run_algorithm(
alg = "saga:sagawetnessindex",
DEM = dem,
SLOPE_TYPE = 1,
SLOPE = tempfile(fileext = ".sdat"),
AREA = tempfile(fileext = ".sdat"),
.quiet = TRUE)
# read in catchment area and catchment slope
ep = ep[c("AREA", "SLOPE")] |>
unlist() |>
terra::rast()
# assign proper names
names(ep) = c("carea", "cslope")
# make sure all rasters share the same origin
origin(ep) = origin(dem)
# add dem and ndvi to the multilayer SpatRaster object
ep = c(dem, ndvi, ep)
ep$carea = log10(ep$carea)
# compute the curvatures
qgis_show_help("grass7:r.slope.aspect")
curvs = qgis_run_algorithm(
"grass7:r.slope.aspect",
elevation = dem,
.quiet = TRUE)
# adding curvatures to ep
curv_nms = c("pcurvature", "tcurvature")
curvs = curvs[curv_nms] |>
unlist() |>
terra::rast()
curvs = terra::app(curvs, as.numeric)
names(curvs) = curv_nms
ep = c(ep, curvs)
random_points[, names(ep)] =
# terra::extract adds an ID column, we don't need
terra::extract(ep, random_points) |>
select(-ID)
elev = dplyr::filter(random_points, id %in% rownames(pa)) %>%
dplyr::pull(dem)
# rotating NMDS in accordance with altitude (proxy for humidity)
rotnmds = MDSrotate(nmds_pa, elev)
# extracting the first two axes
sc = vegan::scores(rotnmds, choices = 1:2, display = "sites")
rp = data.frame(id = as.numeric(rownames(sc)),
sc = sc[, 1])
# join the predictors (dem, ndvi and terrain attributes)
rp = dplyr::inner_join(random_points, rp, by = "id")
# saveRDS(rp, "extdata/15-rp_exercises.rds")
```
E3. Retrieve the bias-reduced RMSE of a random forest\index{random forest} and a linear model using spatial cross-validation\index{cross-validation!spatial CV}.
The random forest modeling should include the estimation of optimal hyperparameter\index{hyperparameter} combinations (random search with 50 iterations) in an inner tuning loop (see Section \@ref(svm)).
Parallelize\index{parallelization} the tuning level (see Section \@ref(svm)).
Report the mean RMSE\index{RMSE} and use a boxplot to visualize all retrieved RMSEs.
Please not that this exercise is best solved using the mlr3 functions `benchmark_grid()` and `benchmark()` (see https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking for more information).
```{r 15-ex-e3, message=FALSE, eval=FALSE}
library(dplyr)
library(future)
library(mlr3)
library(mlr3spatiotempcv)
library(mlr3learners)
library(mlr3viz)
library(paradox)
library(ranger)
# in case you have not run the previous exercises, run:
# rp = readRDS("extdata/15-rp_exercises.rds")
# define the task
task = mlr3spatiotempcv::as_task_regr_st(
dplyr::select(rp, -id, -spri),
target = "sc",
id = "mongon")
# define the learners
mlr3::mlr_learners
# linear model
lrn_lm = mlr3::lrn("regr.lm", predict_type = "response")
# random forest
lrn_rf = mlr3::lrn("regr.ranger", predict_type = "response")
# now define the AutoTuner of the random forest
search_space = paradox::ps(
mtry = paradox::p_int(lower = 1, upper = ncol(task$data()) - 1),
sample.fraction = paradox::p_dbl(lower = 0.2, upper = 0.9),
min.node.size = paradox::p_int(lower = 1, upper = 10)
)
at_rf = mlr3tuning::AutoTuner$new(
learner = lrn_rf,
# spatial partitioning
resampling = mlr3::rsmp("spcv_coords", folds = 5),
# performance measure
measure = mlr3::msr("regr.rmse"),
search_space = search_space,
# random search with 50 iterations
terminator = mlr3tuning::trm("evals", n_evals = 50),
tuner = mlr3tuning::tnr("random_search")
)
# define the resampling strategy
rsmp_sp = mlr3::rsmp("repeated_spcv_coords", folds = 5, repeats = 100)
# create the benchmark design
design_grid = mlr3::benchmark_grid(
tasks = task,
learners = list(lrn_lm, at_rf),
resamplings = rsmp_sp)
print(design_grid)
# execute the outer loop sequentially and parallelize the inner loop
future::plan(list("sequential", "multisession"),
workers = floor(future::availableCores() / 2))
set.seed(10112022)
# reduce verbosity
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("info")
# BE CAREFUL: Running the benchmark might take quite some time
# therefore, we have stored the result of the benchmarking in
# extdata/15-bmr-exercises.rds (see below)
tictoc::tic()
progressr::with_progress(expr = {
bmr = mlr3::benchmark(
design = design_grid,
# New argument `encapsulate` for `resample()` and `benchmark()` to
# conveniently enable encapsulation and also set the fallback learner to the
# respective featureless learner. This is simply for convenience, configuring
# each learner individually is still possible and allows a more fine-grained
# control
encapsulate = "evaluate",
store_backends = FALSE,
store_models = FALSE)
})
tictoc::toc()
# stop parallelization
future:::ClusterRegistry("stop")
# we have saved the result as follows
# saveRDS(bmr, file = "extdata/15-bmr_exercises.rds")
# READ IT IN in in case you don't want to run the spatial CV yourself as it is computationally
# demanding
# bmr = readRDS("extdata/15-bmr_exercises.rds")
# mean RMSE
bmr$aggregate(measures = msr("regr.rmse"))
# or computed manually
agg = bmr$aggregate(measures = msr("regr.rmse"))
# lm performs slightly better when considering the mean rmse
purrr::map(agg$resample_result, ~ mean(.$score(msr("regr.rmse"))$regr.rmse))
# however, looking at the median, the random forest performs slightly better
purrr::map(agg$resample_result, ~ median(.$score(msr("regr.rmse"))$regr.rmse))
# make a boxplot (when using autoplot, mlr3viz needs to be attached!)
library(mlr3viz)
autoplot(bmr, measure = msr("regr.rmse"))
# or doing it "manually"
# extract the AUROC values and put them into one data.table
d = purrr::map_dfr(agg$resample_result, ~ .$score(msr("regr.rmse")))
# create the boxplots
library(ggplot2)
ggplot(data = d, mapping = aes(x = learner_id, y = regr.rmse)) +
geom_boxplot(fill = c("lightblue2", "mistyrose2")) +
theme_bw() +
labs(y = "RMSE", x = "model")
```
```{asis 15-ex-e3-asis, message=FALSE}
In fact, `lm` performs at least as good the random forest model, and thus should be preferred since it is much easier to understand and computationally much less demanding (no need for fitting hyperparameters).
But keep in mind that the used dataset is small in terms of observations and predictors and that the response-predictor relationships are also relatively linear.
```