forked from ColinTalbert/sahm
-
Notifications
You must be signed in to change notification settings - Fork 0
/
LatexMan.txt
2935 lines (2260 loc) · 131 KB
/
LatexMan.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[12pt]{article}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{color}
\usepackage{url}
\usepackage{caption}
\usepackage{epstopdf}
\begin{document}
%\bibliographystyle{plain}
\title{User Manual for SAHM package for Vistrails}
\author{Colin B. Talbert and Marian K. Talbert}
\maketitle
\vspace{2in}
\pagebreak
\tableofcontents
\pagebreak
\begin{flushleft}
\LARGE
\textbf{User Manual For for SAHM package for Vistrails} \\*
\normalsize
\vspace{5mm}
Colin B. Talbert and Marian K. Talbert
\vspace{1cm}
\end{flushleft}
\setlength{\parskip}{.5cm}
\section{FieldData}
The FieldData module allows a user to add presence/absence points or count data recorded
across a landscape for the phenomenon being modeled (e.g., plant sightings, evidence of animal
presence, etc.). The input data for this module must be in the form of a .csv file that follows
one of two formats:
Format 1:
A .csv file with the following column headings, in order: ``X," ``Y," and ``responseBinary".
In this case, the ``X" field should be populated with the horizontal (longitudinal) positional
data for a sample point. The ``Y" field should be populated with the vertical (latitudinal) data
for a sample point. These values must be in the same coordinate system/units as the template
layer used in the workflow. The column ``responseBinary" should be populated with either a `0'
(indicating absence at the point) or a `1' (indicating presence at the point).
Format 2:
A .csv file with the following column headings, in order: ``X," ``Y," and ``responseCount". In
this case, the ``X" field should be populated with the horizontal (longitudinal) positional data
for a sample point. The ``Y" field should be populated with the vertical (latitudinal) data for a
sample point. These values must be in the same coordinate system/units as the template layer
used in the workflow. The column ``responseCount" should be populated with either a `-9999'
(indicating that the point is a background point) or a numerical value (either `0' or a positive
integer) indicating the number of incidences of the phenomenon recorded at that point.
\subsection*{Output Ports}
\begin{itemize}
\item value (mandatory)
This is the actual file object that is being passed to other modules in the workflow.
\textbf{Common connections}
\begin{itemize}
\item The `fieldData\_file' input port of the FieldDataQuery Module if the field data needs subsetting or
aggregation.
\item The `fieldData' input port of the FieldDataAggregateAndWeight Module if the field data needs to be
aggregated or weighted to match the spatial resolution of the template layer.
\item The `fieldData' input port of the MDS builder Module if the field data needs no further pre-
processing prior to modeling.
\end{itemize}
\item value\_as\_string (optional)
This is a VisTrails port that is not used in general SAHM workflows.
\textbf{Common connections}
\begin{itemize}
\item This does not commonly connect to other SAHM modules.
\end{itemize}
\end{itemize}
\section{Predictor}
The Predictor module allows a user to select a single raster layer for consideration in the
modeled analysis. Besides selecting the file the user also specifies the parameters to use for
resampling, aggregation, and whether the data is categorical.
\subsection*{Input Ports}
\begin{itemize}
\item categorical (optional)
This paramater allows a user to indicate the type of data represented. The distinction
between continuous and categorical data will maintained throught a workflow by appending the
word `\_categorical' to categorical layer names in the resulting MDS file. It is also import
to select the nearest neighbor resampling option for categorical layers.
\textbf{Default value} = False (Unchecked)
\textbf{Options}
\begin{itemize}
\item True (Checked) - The data contained in the raster layer is categorical (e.g., landcover categories).
\item False(Unchecked) - The data contained in the raster is continuous (e.g., a DEM layer).
\end{itemize}
\item ResampleMethod (mandatory)
The resample method employed to interpolate new cell values when transforming the raster
layer to the coordinate space or cell size of the template layer.
\textbf{Options}
\begin{itemize}
\item near: nearest neighbour resampling Fastest algorithm, worst interpolation quality, but best choice
for categorical data.
\item bilinear: bilinear resampling, good choice for continuous data.
\item cubic: cubic resampling.
\item cubicspline: cubic spline resampling.
\item lanczos: Lanczos windowed sinc resampling.
\item see: http://www.gdal.org/gdalwarp.html for context
\end{itemize}
\item AggregationMethod (mandatory)
The aggregation method to be used in the event that the raster layer must be up-scaled to
match the template layer (e.g., generalizing a 10 m input layer to a 100 m output layer).
Care should be taken to ensure that the aggregation method that best preserves the integrity
of the data is used. See the PARC module documentation for more information on how
resampling and aggregation are performed.
\textbf{Options}
\begin{itemize}
\item Mean: Average value of all constituent pixels used.
\item Max: Maximum value of all constituent pixels used.
\item Min: Minimum value of all constituent pixels used.
\item Majority: The value occuring most frequently in constituent pixels used.
\item None: No Aggregation used.
\end{itemize}
\item file (mandatory)
The location of the raster file. A user can navigate to the location on their file system.
When a user is selecting an ESRI grid raster, the user should navigate to the `hdr.adf' file
contained within the grid folder
\end{itemize}
\subsection*{Output Ports}
\begin{itemize}
\item value (mandatory)
\textbf{Common connections}
\begin{itemize}
\item The output from this port only connects to the PARC input port `predictor'.
\end{itemize}
\item value\_as\_string (optional)
This is a VisTrails port that is not used in general SAHM workflows.
\textbf{Common connections}
\begin{itemize}
\item Does not generally connect to other SAHM modules.
\end{itemize}
\end{itemize}
\section{TemplateLayer}
The second fundamental input in an analysis is the template layer. It is used to define the
extent and resolution that will be used in all subsequent analyses. The TemplateLayer is a
raster data layer with a defined coordinate system, a known cell size, and an extent that
defines the study area. The data type and values in this raster are not important. All
additional raster layers used in the analysis will be resampled and reprojected as needed to
match the template, snapped to the template, and clipped to have an extent that matches the
template. Users should ensure that additional covariates considered in the analysis have
complete coverage of the template layer used.
\subsection*{Output Ports}
\begin{itemize}
\item value (mandatory)
This is the actual file object that is being passed to other modules in the workflow.
\textbf{Common connections}
\begin{itemize}
\item The `TemplateLayer' input port of the FieldDataAggregationAndWeight Module.
\item The `TemplateLayer' input port of the PARC Module.
\end{itemize}
\item value\_as\_string (optional)
This is a VisTrails port that is not used in general SAHM workflows.
\textbf{Common connections}
\begin{itemize}
\item This does not commonly connect to other SAHM modules.
\end{itemize}
\end{itemize}
\section{PredictorListFile}
The PredictorListFile module allows a user to load a .csv file containing a list of rasters
for consideration in the modeled analysis. The .csv file should contain a header row and four
columns containing the following information, in order, for each raster input.
Column 1: The full file path to the input raster layer.
Column 2: A binary value indicating whether the input layer is categorical or not. A value
of ``0" indicates that an input raster is non-categorical data (continuous), while a value of ``1"
indicates that an input raster is categorical data.
Column 3: The resampling method employed to interpolate new cell values when transforming
the raster layer to the coordinate space or cell size of the template layer, if necessary. The
resampling type should be specified using one of the following values: ``nearestneighbor,"
``bilinear," ``cubic," or ``lanczos."
Column 4: The aggregation method to be used in the event that the raster layer must be up-
scaled to match the template layer (e.g., generalizing a 10 m input layer to a 100 m output
layer). Care should be taken to ensure that the aggregation method that best preserves the
integrity of the data is used. The aggregation should be specified using one of the following
values: ``Min," ``Mean," ``Max," ``Majority," or ``None."
In formatting the list of predictor files, the titles assigned to each of the columns are
unimportant as the module retrieves the information based on the order of the values in the .csv
file (the ordering of the information and the permissible values in the file however, are
strictly enforced). The module also anticipates a header row and will ignore the first row in
the .csv file.
\subsection*{Input Ports}
\begin{itemize}
\item csvFileList (optional)
This is the CSV file on the file system. While not strictly manditory this port will almost
always have an input.
\item predictor (optional)
Allows a user to add individual Predictor modules to a PredictorListFile
\textbf{Common connections}
\begin{itemize}
\item The output port `value' of a Predictor module.
\end{itemize}
\end{itemize}
\subsection*{Output Ports}
\begin{itemize}
\item RastersWithPARCInfoCSV (mandatory)
This port generally connects to the input port `RastersWithPARCInfoCSV' on the PARC module.
\end{itemize}
\section{BoostedRegressionTree}
BRT uses decision trees to partition the the parameter space into the most homogeneous
groups in terms of the response. BRT starts with a single decision tree, then adds a tree that
best explains error in the first tree, and so on. Like random forest, BRT models automatically
model interactions and nonlinear relationships and are robust to missing observations. Our
implementation makes approximately 1,000 trees. It incorporates advanced algorithms for tuning
the model settings, simplifying the model using a cross-validation technique, and for detecting
important interactions between covariates. If more than 500 presence or absence records are
found a random subset will be used for learning rate estimation and model simplification but all
data will be used in the final model fitting step. The cross-validation step within BRT should
not be confused with that provided by the Model Selection Cross Validation step. The former is
used to optimize parameter values when defaults are not provided while the later is used to
select models based on between model comparisons of evaluation metrics. All discussion of
cross-validation related to setting parameters in the BRT argument documentation refers to the
algorithm used for parameter optimization and does not affect the cross validation split
selected by Model Selection and Cross Validation.
Several options are available for fitting BRTs when run using VisTrails special attention is
required before moving away from the defaults because selection of certain parameters will
disallow selection of others. Optional parameters are described briefly here but a more in
depth description can be found in Elith and Leathwich 2008.
\subsection*{Input Ports}
\begin{itemize}
\item mdsFile (mandatory)
The the input data set consisting of locational data for each sample point, the values of
each predictor variable at those points This input file is almost always generated by the
upstream steps.
\textbf{Common connections}
\begin{itemize}
\item The mdsFile can be produced by any of MDSBuilder, ModelEvaluationSplit,
ModelSelectionCrossValidation, MOdelSelectionSplit, or CorariateCorrelationAndSelection.
\end{itemize}
\item makeBinMap (optional)
Indicate whether to discretize the continues probability map into presence absence. See the
ThresholdOptimizationMethod for how this is done. If time is a concern and many models are
to be fit and assessed maps can be produced after model selection for only the best models
using the Select and Test the Final Model tool. Options are available for producing
Probability, Binary and MESS maps there as well.
\textbf{Default value} = False (Unchecked)
\textbf{Options}
\begin{itemize}
\item True (Checked)
\item False (Unchecked)
\end{itemize}
\item makeProbabilityMap (optional)
Indicate whether a map of predicted values is to be produced for the model fit.
\textbf{Default value} = False (Unchecked)
\textbf{Options}
\begin{itemize}
\item True (Checked)
\item False (Unchecked)
\end{itemize}
\item makeMESMap (optional)
Indicate whether to produce a multivariate environmental similarity surface (MESS) and a map
of which factor is limiting at each point see Elith et. al. 2010 for more details. If time
is a concern and many models are to be fit and assessed maps can be produced after model
selection for only the best models using the Select and Test the Final Model tool. Options
are available for producing Probability, Binary and MESS maps there as well.
\textbf{Default value} = False (Unchecked)
\textbf{Options}
\begin{itemize}
\item True (Checked)
\item False (Unchecked)
\end{itemize}
\item ThresholdOptimizationMethod (optional)
Determines how the threshold is set in order to discretize continuous predictions into
binary. These are used for evaluation metrics calculated based on the confusion matrix as
well as for the binary map. The options, directly from the PresenceAbsences package in R
are:
1: Threshold=0.5
2: Sens=Spec sensitivity=specificity
3: MaxSens+Spec maximizes (sensitivity+specificity)/2
4: MaxKappa maximizes Kappa
5: MaxPCC maximizes PCC (percent correctly classified)
6: PredPrev=Obs predicted prevalence=observed prevalence
7: ObsPrev threshold=observed prevalence
8: MeanProb mean predicted probability
9: MinROCdist minimizes distance between ROC plot and (0,1)
The value calculated for the train portion of the data will be applied to the test portion
and if cross validation was specified, the value is calculated separately for each fold
using the threshold from the training data and applying it to the test data for the hold out
fold.
\textbf{Default value} = 2
\textbf{Options}
\begin{itemize}
\item any integer between and including 1 and 9
\end{itemize}
\item Seed (optional)
The random number seed used by BRT. If one desires to reproduce results from a previous BRT
fit, one must enter the random number seed that is reported in the textual output from that
model fit. The seed used is always reported in the textual output.
\textbf{Default value} = Randomly Generated
\textbf{Options}
\begin{itemize}
\item Any integer between -2147483647 and 2147483647
\end{itemize}
\item TreeComplexity (optional)
Sets the level of interactions fitted in the model. A tree complexity of 1 fits no
interactions, 2 will fit up to but not necessarily all two way interactions and so on.
\textbf{Default value} = If not set, tree complexity will be selected based on the number of observations and what produces the best model.
\textbf{Options}
\begin{itemize}
\item any positive integer (generally no greater than 3)
\end{itemize}
\item BagFraction (optional)
Controls the proportion of the data that is used to fit the model at each step. Using a bag
fraction of 1 will give a fully deterministic model but this is generally not preferable as
stochasticity generally improves model performance (Elith and Leathwick 2008).
\textbf{Default value} = .75
\textbf{Options}
\begin{itemize}
\item Any positive number greater than 0 and less than or equal to 1
\end{itemize}
\item NumberOfFolds (optional)
If cross-validation is used for model simplification, this sets the number of folds used for
cross-validation.
\textbf{Default value} = 3
\textbf{Options}
\begin{itemize}
\item A positive integer (generally between 2 and 10)
\end{itemize}
\item Alpha (optional)
Controls when the algorithm stops in the model simplification step. The change in deviance
is calculated between the previous and current iteration in model simplification and if the
average change in deviance per observation is less than the standard error of the original
deviance multiplied by alpha then the simplification step is accepted as long as we have not
reached the maximum number of drops allowed.
\textbf{Default value} = 1
\textbf{Options}
\begin{itemize}
\item Any positive floating point value is valid
\end{itemize}
\item PrevalenceStratify (optional)
This specifies whether cross validation samples should be stratified to match the overall
prevalence. This is currently only valid for presence absence data and is only used in
model simplification.
\textbf{Default value} = True (Checked)
\textbf{Options}
\begin{itemize}
\item True (Checked)
\item False (Unchecked)
\end{itemize}
\item ToleranceMethod (optional)
Method used in determining when to stop model simplification.
\textbf{Default value} = ``auto"
\textbf{Options}
\begin{itemize}
\item Either ``auto" or ``fixed"
\end{itemize}
\item Tolerance (optional)
Can be set to control the stopping rule in model simplification. If ToleranceMethod is set
to auto this value will be multiplied by the mean total deviance of the null model.
Change in deviance is compared to the tolerance to determine when to stop model
simplification.
\textbf{Default value} = .001
\textbf{Options}
\begin{itemize}
\item Any positive floating point value is valid
\end{itemize}
\item LearningRate (optional)
Controls the amount each tree contributes to the model. A small learning rate restricts
individual tree contributions to the overall model.
\textbf{Default value} = If not specified, learning rate will be determined based on the number of trees and the tree complexity.
\textbf{Options}
\begin{itemize}
\item Any positive number greater than 0 and less than 1
\end{itemize}
\item MaximumTrees (optional)
The absolute upper limit on the total number of tress to fit. Setting this below 5000 will
result in an error.
\textbf{Default value} = 10,000
\textbf{Options}
\begin{itemize}
\item Any positive integer greater than 5,000
\end{itemize}
\end{itemize}
\subsection*{Output Ports}
\begin{itemize}
\item modelWorkspace
The R workspace where all internal details regarding the fitted model are stored. This is
used by the Select and Test the Final Model module.
\textbf{Common connections}
\begin{itemize}
\item `modelWorkspace' port of SAHMModelOutputViewerCell for viewing the aspatial model output.
\item `modelWorkspace' port of SAHMSpatialOutpuViewerCell for viewing the spatial model output in a mini
GIS.
\end{itemize}
\item BinaryMap
If specified using MakeBinaryMap=True then a surface of binary predictions is produced by
discretizing the probability map based on the selected threshold. This map indicates
whether one could expect each site to be occupied or unoccupied based on the model.
\item ProbabilityMap
If specified using MakeProbabilityMap=True then a surface of predicted values is produced
based on the tiffs in the input .mds file and the fitted model. These can but do not always
indicate the probability of finding the species at a given site. For example if model
calibration is poor then these will not agree well with the true probabilities though
discrimination between presence and absences might still be good.
\item ResidualsMap
Model residual plots show the spatial relationship between the model deviance residuals.
Most models assume residuals will be independent thus spatial pattern in the deviance
residuals can be indicative of a problem with the model fit and inference based on the fit.
It can for example indicate that important predictors were not included in the model and can
be compared with the spatial pattern of predictors that were not included in the model.
Whether or not a significant spatial pattern exists in model residuals can at times be
difficult to assess visually. We hope to add correlograms of Moran`s I soon. Unfortunately
statistical tests based on the Moran's I statistic for residuals of binary response models
lack statistical justification and thus cannot be used to test for a significant spatial
pattern (Bivand 2008). See Dormann 2007 for more discussion on evaluation of model
residuals and spatial models that are appropriate for species distribution modeling.
Residual plots can also be used to determine if certain observations contribute
disproportionately to the deviance of the fitted model. For a binary response model
deviance residuals with absolute values greater than 2 can be indicative of a problem.
\item MessMap
If specified by selecting makeMESMap=True the the MESS and MoD surfaces will be produced.
The MESS surface is the multivariate environment similarity surface and shows how well each
point fits into the univariate ranges of the points for which the model was fit. Negative
values in this map indicate that the point is out of the range of the training data. The
MESS map takes the minimum value of a statistic calculate for each predictor and thus cannot
diagnose hidden extrapolation as one might do using a hat matrix. This surface is only
calculated for variables that are selected in the model selection step within each model
fitting algorithm so that variables that do not significantly affect the occurrence of the
organism over the range of the training data will not be included in the MESS map even
though these predictors might be influential to the organism outside the range in which the
model was fit. Random Forest never drops predictors so if one wishes to compare the MESS
and MoD map before and after insignificant predictors were dropped, one can compare the MESS
map of a Random Forest fit to that produced from the other model fit algorithms as long as
they were fit using the same dataset. See Elith et. al. 2010 for details on how the MESS
map calculations are performed.
\item MoDMap
If specified by selecting makeMESMap=TRUE the the MESS and MoD surfaces will be produced.
The MoD map is related to the MESS map and indicates which variable was furthest from the
range over which the model was fit for each spatial location. See Elith et. al. 2010 for
details on how the MESS map calculations are performed.
\item modelEvalPlot
For binary data this will be a Receiver operating characteristic curve. Which shows the
relationship between sensitivity and specificity as the threshold for discretizing
continuous predictions into presence absence is varied. The threshold selected using the
specified ThresholdOptimizationMethod is shown. If a model selection test training split
was specified the ROC curve for this will be shown in red and if a cross-validation split
was specified ROC curves for each cross-validation fold will be overlayed with box plots
summarizing cross-validation results. If the model fits well both sensitivity and
specificity should be well above the diagonal line. If there is a strong disparity between
the curves for the training data and either the testing split or cross validation standard
deviation curves this can be indicative of model overfitting. These plots and the
evaluation metrics based on the confusion matrix describe the models ability to discriminate
between presence and absence points. The AUC value, or area under the ROC curve, is the
probability that the model will rank a randomly chosen presence observation higher than a
randomly chosen absence observation. For count data this display will show several standard
plots for assessment of model residuals.
\item ResponseCurves
Model response curves show the relationship between each predictor included in the model,
while holding all other predictors constant at their means, and the fitted values. MARS
response curves are shown on a logit scale thus the response axis will not necessarily be
bounded on the 0 to 1 interval. BRT response curves will show response surfaces for any
interaction terms included in the final model along with the percent relative influence.
\item Text\_Output
This file contains a summary of the model fit. The information contained here includes the
number of presence observations (counts equal to or greater than 1 for count models), the
number of absence points, the number of covariates that were considered by the model
selection algorithm. Note all of these can differ from the numbers in the original .mds due
to incomplete records being deleted, and predictors with only one unique value being
removed. The random number seed is recorded if applicable which allows completely
reproducible results as well as a summary of the model fit. Evaluation Statistics are
reported for the data used to fit the model as well as for the test or cross-validation
split if applicable. References for how to interpret most of these are ubiquitous in the
literature but it is worth mentioning that interpretation of the calibration statistics is
described by Pearce and Ferrier 2000 as well as Miller and Hui 1991. Most metrics reported
here can also be found in related graphical displays.
\end{itemize}
References:
Bivand, R.S., Pebesma, E.J., and Gomez-Rubio, V. (2008). Applied Spatial Data Analysis with R.
Springer New York, NY.
Dormann, C.F., McPherson, J.M., Araujo, M.B., Bivand, R., Bolliger, J., et al. (2007). Methods to
account for spatial autocorrelation in the analysis of species distributional data: a
review. Ecography 30:60928.
Elith, J., Kearney, M., Phillips, S. (2010). The art of modeling range-shifting species. Methods
Ecol Evol 1:330342
Elith, J., Leathwick, J.R. and Hastie, T. (2008). A working guide to boosted regression trees.
Journal of Animal Ecology, 77, 802813.
Miller, M.E., Hui, S.L., Tierney, W.M. (1991). Validation techniques for logistic regression models.
Statistics in Medicine 10: 1213-26
Pearce, J., and S. Ferrier. (2000). Evaluating the predictive performance of habitat models
developed using logistic regression. Ecological Modelling 133:225245.
R Development Core Team (2011). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria.
\section{RandomForest}
\subsection*{Input Ports}
\begin{itemize}
\item mdsFile (mandatory)
The the input data set consisting of locational data for each sample point, the values of
each predictor variable at those points This input file is almost always generated by the
upstream steps.
\textbf{Common connections}
\begin{itemize}
\item The mdsFile can be produced by any of MDSBuilder, ModelEvaluationSplit,
ModelSelectionCrossValidation, MOdelSelectionSplit, or CovariateCorrelationAndSelection.
\end{itemize}
\item makeBinMap (optional)
Indicate whether to discretize the continues probability map into presence absence. See the
ThresholdOptimizationMethod for how this is done. If time is a concern and many models are
to be fit and assessed maps can be produced after model selection for only the best models
using the Select and Test the Final Model tool. Options are available for producing
Probability, Binary and MESS maps there as well.
\textbf{Default value} = False (Unchecked)
\textbf{Options}
\begin{itemize}
\item True (Checked)
\item False (Unchecked)
\end{itemize}
\item makeProbabilityMap (optional)
Indicate whether a map of predicted values is to be produced for the model fit.
\textbf{Default value} = False (Unchecked)
\textbf{Options}
\begin{itemize}
\item True (Checked)
\item False (Unchecked)
\end{itemize}
\item makeMESMap (optional)
Indicate whether to produce a multivariate environmental similarity surface (MESS) and a map
of which factor is limiting at each point see Elith et. al. 2010 for more details. If time
is a concern and many models are to be fit and assessed maps can be produced after model
selection for only the best models using the Select and Test the Final Model tool. Options
are available for producing Probability, Binary and MESS maps there as well.
\textbf{Default value} = False (Unchecked)
\textbf{Options}
\begin{itemize}
\item True (Checked)
\item False (Unchecked)
\end{itemize}
\item ThresholdOptimizationMethod (optional)
Determines how the threshold is set in order to discretize continuous predictions into
binary. These are used for evaluation metrics calculated based on the confusion matrix as
well as for the binary map. The options, directly from the PresenceAbsences package in R
are:
1: Threshold=0.5
2: Sens=Spec sensitivity=specificity
3: MaxSens+Spec maximizes (sensitivity+specificity)/2
4: MaxKappa maximizes Kappa
5: MaxPCC maximizes PCC (percent correctly classified)
6: PredPrev=Obs predicted prevalence=observed prevalence
7: ObsPrev threshold=observed prevalence
8: MeanProb mean predicted probability
9: MinROCdist minimizes distance between ROC plot and (0,1)
The value calculated for the train portion of the data will be applied to the test portion
and if cross validation was specified, the value is calculated separately for each fold
using the threshold from the training data and applying it to the test data for the hold out
fold.
\textbf{Default value} = 2
\textbf{Options}
\begin{itemize}
\item any integer between and including 1 and 9
\end{itemize}
\item Seed (optional)
The random number seed used by BRT. If one desires to reproduce results from a previous
randomForest fit, one must enter the random number seed that is reported in the textual
output from that model fit. The seed used is always reported in the textual output.
\textbf{Default value} = Randomly Generated
\textbf{Options}
\begin{itemize}
\item Any integer between -2147483647 and 2147483647
\end{itemize}
\item mTry (optional)
By default this is optimized using the tuneRF function so that OOB error is minimized. See
the CRAN website for more details.
\textbf{Default value} = this is optimized using the tuneRF function so that out of bag error is minimized.
\textbf{Options}
\begin{itemize}
\item A number between 1 and the total number of valid parameters used in model fitting
\end{itemize}
\item nTrees (optional)
See the randomForest documentation on the CRAN website for details
http://cran.r-project.org/web/packages/randomForest/index.html.
\textbf{Default value} = randomForest function default
\item nodesize (optional)
See the randomForest documentation on the CRAN website for details
http://cran.r-project.org/web/packages/randomForest/index.html.
\textbf{Default value} = randomForest function default
\textbf{Options}
\begin{itemize}
\item See randomForest documentation for valid input
\end{itemize}
\item replace (optional)
See the randomForest documentation on the CRAN website for details
http://cran.r-project.org/web/packages/randomForest/index.html.
\textbf{Default value} = randomForest function default
\textbf{Options}
\begin{itemize}
\item See randomForest documentation for valid input
\end{itemize}
\item maxnodes (optional)
See the randomForest documentation on the CRAN website for details
http://cran.r-project.org/web/packages/randomForest/index.html.
\textbf{Default value} = randomForest function default
\textbf{Options}
\begin{itemize}
\item See randomForest documentation for valid input
\end{itemize}
\item importance (optional)
See the randomForest documentation on the CRAN website for details
http://cran.r-project.org/web/packages/randomForest/index.html.
\textbf{Default value} = randomForest function default
\textbf{Options}
\begin{itemize}
\item See randomForest documentation for valid input
\end{itemize}
\item localImp (optional)
See the randomForest documentation on the CRAN website for details
http://cran.r-project.org/web/packages/randomForest/index.html.
\textbf{Default value} = randomForest function default
\textbf{Options}
\begin{itemize}
\item See randomForest documentation for valid input
\end{itemize}
\item proximity (optional)
See the randomForest documentation on the CRAN website for details
http://cran.r-project.org/web/packages/randomForest/index.html.
\textbf{Default value} = randomForest function default
\textbf{Options}
\begin{itemize}
\item See randomForest documentation for valid input
\end{itemize}
\item oobProx (optional)
See the randomForest documentation on the CRAN website for details
http://cran.r-project.org/web/packages/randomForest/index.html.
\textbf{Default value} = randomForest function default
\textbf{Options}
\begin{itemize}
\item See randomForest documentation for valid input
\end{itemize}
\item normVotes (optional)
See the randomForest documentation on the CRAN website for details
http://cran.r-project.org/web/packages/randomForest/index.html.
\textbf{Default value} = randomForest function default
\textbf{Options}
\begin{itemize}
\item See randomForest documentation for valid input
\end{itemize}
\item doTrace (optional)
See the randomForest documentation on the CRAN website for details
http://cran.r-project.org/web/packages/randomForest/index.html.
\textbf{Default value} = randomForest function default
\textbf{Options}
\begin{itemize}
\item See randomForest documentation for valid input
\end{itemize}
\item keepForest (optional)
See the randomForest documentation on the CRAN website for details
http://cran.r-project.org/web/packages/randomForest/index.html.
\textbf{Default value} = randomForest function default
\textbf{Options}
\begin{itemize}
\item See randomForest documentation for valid input
\end{itemize}
\end{itemize}
\subsection*{Output Ports}
\begin{itemize}
\item modelWorkspace
The R workspace where all internal details regarding the fitted model are stored. This is
used by the Select and Test the Final Model module.
\textbf{Common connections}
\begin{itemize}
\item `modelWorkspace' port of SAHMModelOutputViewerCell for viewing the aspatial model output.
\item `modelWorkspace' port of SAHMSpatialOutpuViewerCell for viewing the spatial model output in a mini
GIS.
\end{itemize}
\item BinaryMap
If specified using MakeBinaryMap=True then a surface of binary predictions is produced by
discretizing the probability map based on the selected threshold. This map indicates
whether one could expect each site to be occupied or unoccupied based on the model.
\item ProbabilityMap
If specified using MakeProbabilityMap=True then a surface of predicted values is produced
based on the tiffs in the input .mds file and the fitted model. These can but do not always
indicate the probability of finding the species at a given site. For example if model
calibration is poor then these will not agree well with the true probabilities though
discrimination between presence and absences might still be good.
\item ResidualsMap
Model residual plots show the spatial relationship between the model deviance residuals.
Most models assume residuals will be independent thus spatial pattern in the deviance
residuals can be indicative of a problem with the model fit and inference based on the fit.
It can for example indicate that important predictors were not included in the model and can
be compared with the spatial pattern of predictors that were not included in the model.
Whether or not a significant spatial pattern exists in model residuals can at times be
difficult to assess visually. We hope to add correlograms of Moran`s I soon. Unfortunately
statistical tests based on the Moran's I statistic for residuals of binary response models
lack statistical justification and thus cannot be used to test for a significant spatial
pattern (Bivand 2008). See Dormann 2007 for more discussion on evaluation of model
residuals and spatial models that are appropriate for species distribution modeling.
Residual plots can also be used to determine if certain observations contribute
disproportionately to the deviance of the fitted model. For a binary response model
deviance residuals with absolute values greater than 2 can be indicative of a problem.
\item MessMap
If specified by selecting makeMESMap=True the the MESS and MoD surfaces will be produced.
The MESS surface is the multivariate environment similarity surface and shows how well each
point fits into the univariate ranges of the points for which the model was fit. Negative
values in this map indicate that the point is out of the range of the training data. The
MESS map takes the minimum value of a statistic calculate for each predictor and thus cannot
diagnose hidden extrapolation as one might do using a hat matrix. This surface is only
calculated for variables that are selected in the model selection step within each model
fitting algorithm so that variables that do not significantly affect the occurrence of the
organism over the range of the training data will not be included in the MESS map even
though these predictors might be influential to the organism outside the range in which the
model was fit. Random Forest never drops predictors so if one wishes to compare the MESS
and MoD map before and after insignificant predictors were dropped, one can compare the MESS
map of a Random Forest fit to that produced from the other model fit algorithms as long as
they were fit using the same dataset. See Elith et. al. 2010 for details on how the MESS
map calculations are performed.
\item MoDMap
If specified by selecting makeMESMap=TRUE the the MESS and MoD surfaces will be produced.
The MoD map is related to the MESS map and indicates which variable was furthest from the
range over which the model was fit for each spatial location. See Elith et. al. 2010 for
details on how the MESS map calculations are performed.
\item modelEvalPlot
For binary data this will be a Receiver operating characteristic curve. Which shows the
relationship between sensitivity and specificity as the threshold for discretizing
continuous predictions into presence absence is varied. The threshold selected using the
specified ThresholdOptimizationMethod is shown. If a model selection test training split
was specified the ROC curve for this will be shown in red and if a cross-validation split
was specified ROC curves for each cross-validation fold will be overlayed with box plots
summarizing cross-validation results. If the model fits well both sensitivity and
specificity should be well above the diagonal line. If there is a strong disparity between
the curves for the training data and either the testing split or cross validation standard
deviation curves this can be indicative of model overfitting. These plots and the
evaluation metrics based on the confusion matrix describe the models ability to discriminate
between presence and absence points. The AUC value, or area under the ROC curve, is the
probability that the model will rank a randomly chosen presence observation higher than a
randomly chosen absence observation. For count data this display will show several standard
plots for assessment of model residuals.
\item ResponseCurves
Model response curves show the relationship between each predictor included in the model,
while holding all other predictors constant at their means, and the fitted values. MARS
response curves are shown on a logit scale thus the response axis will not necessarily be
bounded on the 0 to 1 interval. BRT response curves will show response surfaces for any
interaction terms included in the final model along with the percent relative influence.
\item Text\_Output
This file contains a summary of the model fit. The information contained here includes the
number of presence observations (counts equal to or greater than 1 for count models), the
number of absence points, the number of covariates that were considered by the model
selection algorithm. Note all of these can differ from the numbers in the original .mds due
to incomplete records being deleted, and predictors with only one unique value being
removed. The random number seed is recorded if applicable which allows completely
reproducible results as well as a summary of the model fit. Evaluation Statistics are
reported for the data used to fit the model as well as for the test or cross-validation
split if applicable. References for how to interpret most of these are ubiquitous in the
literature but it is worth mentioning that interpretation of the calibration statistics is
described by Pearce and Ferrier 2000 as well as Miller and Hui 1991. Most metrics reported
here can also be found in related graphical displays.
\end{itemize}
References:
Bivand, R.S., Pebesma, E.J., and Gomez-Rubio, V. (2008). Applied Spatial Data Analysis with R.
Springer New York, NY.
Dormann, C.F., McPherson, J.M., Araujo, M.B., Bivand, R., Bolliger, J., et al. (2007). Methods to
account for spatial autocorrelation in the analysis of species distributional data: a
review. Ecography 30:60928.
Elith, J., Kearney, M., Phillips, S. (2010). The art of modeling range-shifting species. Methods
Ecol Evol 1:330342
Liaw, A. and Wiener M. (2002). Classification and Regression by randomForest. R News 2(3), 18--22.
Miller, M.E., Hui, S.L., Tierney, W.M. (1991). Validation techniques for logistic regression models.
Statistics in Medicine 10: 1213-26
Pearce, J., and S. Ferrier. (2000). Evaluating the predictive performance of habitat models
developed using logistic regression. Ecological Modelling 133:225245.
R Development Core Team (2011). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria.
\section{MAXENT}