Readme.Rmd

---
title: "The invacost R Package: Global Costs of Biological Invasions"
author: "Leroy B, Kramer AM, Vaissière AC, Kourantidou M, Courchamp F & Diagne C"
date: "`r Sys.setlocale('LC_TIME', 'C');format(Sys.time(), '%d %B, %Y')`"
output:
  github_document:
    toc: true
    toc_depth: 2
  html_document:
    df_print: kable
    toc: yes
    toc_float: true
---

```{r hiddenload, echo=FALSE, results="hide", message=FALSE, warning=FALSE}
# # knitr::opts_chunk$set(warning=FALSE,
#                       message=FALSE)
library(invacost)
data(invacost)

```

# Introduction 

The invacost R package provides the InvaCost database in R with several 
functions to analyse monetary costs of invasive species. 

There are two main methods developed in this package to estimate the costs
associated
to biological invasions.

 - *Approach based on available estimates*: the first approach consists in  using
 cost estimates, as they appear in the collected materials, that occurred over
 specific intervals of time. From this approach you can obtain the 
 cumulative and 
 average annual costs of biological invasions for different time intervals.  
 This approach is generalisable to any subset of the InvaCost database,
 regardless of data quantity. However, it does not account for the temporal dynamics
 of costs, and may thus underestimate the costs of invasions, especially for
 recent years for which we likely have incomplete data.
 
 - *Modelling approach*: the second approach consists in **estimating** the
 long-term trend in annual cost 
 values with a statistical modelling approach. Because it includes the dynamic 
 nature of cost values, it is probably the most robust approach to estimate 
 average annual costs. However, it requires more data and for data-poor cases it
 will not work or provide inadequate results. We fit several models because the
 trend of costs over time is not necessarily linear.
 This approach requires decisions about  models to keep or remove, on the basis
 of the quality of model fit, and/or of our *a priori* assumptions. Note that this
 modelling approach is not designed for extrapolations because there is no
 certainty that the underlying factors of costs will have similar trends in the
 future.
 
# Acknowledgements

Thanks a lot to Anna Turbelin for her help in correcting mistakes in this 
tutorial, and to Phillip Haubrock for agreeing to be the "Guinea pig" of 
various test versions of the invacost R package!
 
# Installation


**The package requires at least R 4.0.0.** 

The package can now be officially installed from CRAN.

``` {r install, eval = FALSE}
# install.packages("devtools")
# devtools::install_github("Farewe/invacost")
install.packages("invacost")
```

# Recent changes


<details> 
  <summary><b>Click here</b> to see a summary of recent changes</summary>
  
*February 2022*

- Updated to invacost *version 4.1*
- Added new data integrity tests / checks in the package. These checks
were used to build invacost version 4.1.

*June 2021*

- Updated the database version in the package to *version 4.0*.


*December 2020*

- *The main functions of the package have changed names.* `calculateRawAvgCosts`
is now `summarizeCosts`, and `costTrendOverTime` is now `modelCosts`. Older
function names are still functional but will throw a warning, and will no longer
be updated. Please update your scripts accordingly.

- *Default values in `summarizeCosts` (formerly `calculateRawAvgCosts`) and 
`modelCosts` (formerly `costTrendOverTime`) have changed.* First, the last 
year of analysis will, by default, be the last year of your input data. Naturally,
you can still define it manually. Second, by default, `modelCosts` will no 
longer exclude data from calibration (formerly, `costTrendOverTime` excluded
by default data from 2015 and above). If you wish to keep this threshold as
in former versions of the packages, you can still alter it by manually setting 
`incomplete.year.threshold = 2015`.

- A new function is available (`getInvaCostVersion`) to rollback the database to
former major releases. Current options are: `1.0`, `2.0`, `2.1`, `3.0`, `4.0` &
`4.1`. This is
to ensure reproducibility of analyses performed on former versions of the 
database. Naturally, any new analysis should be done on the the latest version
shipped with the package (`data(invacost)`).
</details>

For more detailed information about version of the invacost R package,
see the file "NEWS" in the package files.

# Basic steps to get started
## How to load the data in R?

```{r i2, cache = TRUE}
library(invacost)
data(invacost)
```
The database is now loaded in R memory under a very original name: `invacost`.

```{r i3, cache = TRUE}
# Number of rows (cost estimates)
nrow(invacost)
# Number of columns (database descriptors)
ncol(invacost)
```

There are `r ncol(invacost)` columns in the INVACOST database, and I strongly 
recommend you get familiarised with them by inspecting them individually and
reading the [paper describing the database (and its fields) available here]
(https://doi.org/10.1038/s41597-020-00586-z). The updated versions of the 
database (and its fields) are stored in a [figShare repository accessible here](https://doi.org/10.6084/m9.figshare.12668570).

## How to distinguish between invacost v1 and subsequent versions?

Since the first official release of InvaCost in 2019, there have been several
major releases, some of which were internal, some of which were public. To 
ensure reproducibility of analyses performed on older releases, we provide a 
function in the database to fetch older releases. we expect it will also help
to understand how and why results may change with new additions and corrections
to this living database.

As of now, we offer the possibility to roll back to version 1.0, 2.0 and 2.1. 

<details> 
  <summary> <b>Click here</b> for an example on how to roll back to V1 or V2.1</summary>
  
```{r rollback, cache = TRUE}
# # Version 1.0
# invacost_1.0 <- getInvaCostVersion("1.0")
# dim(invacost_1.0)
# 
# # Version 2.1
# invacost_2.1 <- getInvaCostVersion("2.1")
# dim(invacost_2.1)
```
</details>

If you want, the database can be automatically saved on your hard drive in csv
format, by specifying a file name with argument `destination_file`.

## Where are the monetary values I should use?

There are several columns that contain monetary values (they generally contain
`cost` in their names). We stored raw information from the papers that is often
difficult to compare because of
different currencies, different years and different periods of time over which
costs or expenditure occurred
(e.g. 1 year vs 10 years). Therefore, **to be able to choose which monetary values 
to use, you need to understand some of the basic concepts about the conversion
of monetary values** (raw costs or costs per year) into a standard currency (i.e.
US$ for the year 2017). **Hereafter, we explain the process and provide 
recommendations on which monetary values to use.**

The conversion is based on two main steps. The first includes conversion of the
cost estimates from local currencies to US$. The second includes conversion the 
cost estimate (previously converted to US\$) from the year it was estimated to 
2017 US\$. 

### Step 1: Currency conversion

For the currency conversion step (from local currency to US\$), it is possible
to use two conversion factors: 

 - [The official market exchange rate](https://data.worldbank.org/indicator/PA.NUS.FCRF?end=2017&start=1960).
 
 - The Purchasing Power Parity (PPP, local currency unit per US\$, calculated as 
 an annual average), that allows to take into account the differences in price 
 levels between countries. We used data provided [by the World Bank](https://data.worldbank.org/indicator/PA.NUS.PPP?end=2017&start=1990), or
 [by the OECD](https://data.oecd.org/conversion/purchasing-power-parities-ppp.htm)
 when information was not retrievable through the World Bank database. 
 Occasionally, we had to deal with published costs that were expressed in currency
 that was different from the country where the costs were estimated (e.g.
 published cost in African countries expressed in US or Canadian \$). Thus, prior
 to using PPP as a conversion index, we had to perform a retro-conversion by
 multiplying the original cost estimate by the official market exchange rate 
 (local currency unit per currency unit used). Note that, it was not possible to
 perform the PPP-based standardisation for all cost estimates as PPP data
 do not exist for all countries and/or specific periods (we mentioned `NA` in the
 database when such information was missing).
 
### Step 2: Conversion of the year of currency

For the **year of currency conversion step (from US\$ from a given year to 2017 US\$)**, we use an inflation factor that reflects the evolution of the value of 
the US\$ since the year of cost estimation. The inflation factor is computed by 
dividing the [Consumer Price Index](https://data.worldbank.org/indicator/FP.CPI.TOTL?end=2017&start=1960) 
(CPI, which is a measure of the average change
over time in the prices paid by consumers for a market basket of consumer goods 
and services) of the USA for 2017 by the CPI of the USA for the year of the cost
estimation. 

The standardization of the cost estimates into 2017 \$US is thus a multi-step 
procedure for which the following formula is used:
**C~e~ = (M~V~ /C~F~) x I~F~**

with **C~e~** = Converted cost estimate (to 2017 US dollars based on exchange rate or 
Purchase Power Parity), **M~V~** = Cost estimate (either the 
`Raw_cost_estimate_original_currency` extracted from analysed paper or the
`Cost_estimate_per_year_original_currency`
transformed by us), **C~F~** = Conversion factor (either the official market exchange 
rate or the purchasing power parity, in US dollars), **I~F~** = Inflation factor since
the year of cost estimation, calculated as CPI~2017~ / CPI~y~ with CPI corresponding 
to the Consumer Price Index and **y** corresponding to the year of the cost 
estimation (`Applicable_year`).


### Choosing monetary values in InvaCost

In InvaCost, we provide several columns with monetary values expressed in
2017 USD based on the exchange rate or PPP - they contain either `exchange_rate`
or `PPP` in their names.


<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky"></th>
    <th class="tg-fymr">Purchase Power Parity (PPP)</th>
    <th class="tg-fymr">Exchange Rate (ER)</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-5pln">Definitions</td>
    <td class="tg-y698"><a href="https://data.oecd.org/conversion/purchasing-power-parities-ppp.htm" target="_blank" rel="noopener noreferrer">OECD definition of PPP</a></td>
    <td class="tg-y698"><a href="https://data.oecd.org/conversion/purchasing-power-parities-ppp.htm" target="_blank" rel="noopener noreferrer">OECD definition of ER</a></td>
  </tr>
  <tr>
    <td class="tg-jq0u">Useful resources to choose between PPP and ER</td>
    <td class="tg-c6of" colspan="2"><a href="https://doi.org/10.1017/CBO9780511754159.005" target="_blank" rel="noopener noreferrer">- Sarno, L., Taylor, M., &amp; Frankel, J. (2003). Purchasing power parity and the real exchange rate</a>. In The Economics of Exchange Rates (pp. 51-96). Cambridge: Cambridge University Press<br><a href="https://doi.org/10.1257/0895330042632744" target="_blank" rel="noopener noreferrer">- Taylor, A. M., &amp; Taylor, M. P. (2004). The purchasing power parity debate.</a> Journal of economic perspectives, 18(4), 135-158. <br>- <a href="https://doi.org/10.1787/oecd_papers-v6-art9-en" target="_blank" rel="noopener noreferrer">OECD. (2006). A Contribution to the "Purchase Power Parity vs. Market Exchange Rates for Use in Long-Term Scenarios" Discussion.</a> OECD papers, 19pp.<br><a href="https://www.imf.org/external/pubs/ft/fandd/2007/03/basics.htm" target="_blank" rel="noopener noreferrer">- Callen T. (2007) PPP Versus the Market: Which Weight Matters? Finance and Development 44:1 </a><br><a href="https://www.worldbank.org/en/programs/icp/brief/VC_Uses" target="_blank" rel="noopener noreferrer">- PPPs for policy making: a visual guide to using data from the ICP - Uses and limitations of PPPs</a><br><a href="https://doi.org/10.1596/978-1-4648-1530-0" target="_blank" rel="noopener noreferrer">- World Bank. (2020) Purchasing Power Parities and the Size of World Economies: Results from the 2017 International Comparison Program.</a>Washington, DC: World Bank.<br><a href="https://thedocs.worldbank.org/en/doc/872661487091786258-0050022017/original/19Chapter18.pdf" target="_blank" rel="noopener noreferrer">- McCarthy, P. (2013) Extrapolating PPPs and comparing ICP benchmark results</a>. Measuring the Real Size of the World Economy, Chapter 18, 473-506<br></td>
  </tr>
  <tr>
    <td class="tg-5pln">Column to use in invacost</td>
    <td class="tg-qoao"><span style="color:#905">`Cost_estimate_per_year_2017_USD_PPP`</span></td>
    <td class="tg-qoao"><span style="color:#905">`Cost_estimate_per_year_2017_USD_exchange_rate`</span></td>
  </tr>
  <tr>
    <td class="tg-jq0u">Examples</td>
    <td class="tg-c6of">PPP are more robust when performing multiple-country comparisons, as long as all the investigated countries have available PPP data (e.g., <a href="https://doi.org/10.1038/s41467-020-15586-1" target="_blank" rel="noopener noreferrer">Schaffner et al. 2020</a>)</td>
    <td class="tg-c6of">To analyse the economic impacts of invasive alien species at the global scale, the absence of PPP information for many countries and time periods requires to rely on exchange rates (e.g., <a href="https://doi.org/10.1038/s41586-021-03405-6" target="_blank" rel="noopener noreferrer">Diagne et al. 2020</a>)</td>
  </tr>
</tbody>
</table>

Therefore, because our objective here is to illustrate the costs of biological 
invasions at the global scale, we will, onwards, in the sections that
follow, use exclusively the column: `Cost_estimate_per_year_2017_USD_exchange_rate`.

Replace the code with `Cost_estimate_per_year_2017_USD_PPP` if your study
requires using Purchase Power Parity instead of Exchange Rates.


### Filtering out estimates with missing information

There is a small number of costs for which we could not derive these estimates
(e.g. studies that lacked useful information of the periods over which
costs occurred), so we have to exclude them of our analyses.

```{r z4, cache = TRUE}
if(any(is.na(invacost$Cost_estimate_per_year_2017_USD_exchange_rate)))
{
  invacost <- invacost[-which(is.na(invacost$Cost_estimate_per_year_2017_USD_exchange_rate)), ]
}

# Number of rows (cost estimates)
nrow(invacost)
```


In addition, there can be costs for which we have **inadequate time 
period information**: some studies omitted to provide time 
periods, which can cause tremendous biases when analysing the temporal trends of
costs. Sometimes, this is not problematic as we can safely guess when costs 
occurred (see section [How do I know when costs occurred](#costtime)). 
However, other times, costs spread over 10+ and
sometimes 100+ years can be presented
as occurring over a single year or require highly speculative guesses about when 
they occurred. As a general rule, it is much easier and safer to estimate the
ending year than the starting year. 

Therefore, it is useful to verify if there are estimates that are known to occur
over more than 1 year with no information on
the starting date whatsoever in your subset of
InvaCost. **Note: we have progressively corrected the database such such entries
should no longer occur in InvaCost version 4.1 and later.**. If you find such 
estimates, we recommend to remove them, as it is not possible to safely estimate 
if they occurred over a single or multiple 
years. 

The following block of code does all of that automatically, but will do nothing 
if there are no such cases ;)

```{r filter2}
# Uncertain starting periods
uncertain.starts <- invacost[which(invacost$Time_range == "Period" &
                                     is.na(invacost$Probable_starting_year)), ]
# Number of estimates without adequate information about starting year
nrow(uncertain.starts)

# No info about whether cost was annual or over a period
unknown.periods <- invacost[which(is.na(invacost$Time_range)), ]
# Number of estimates without adequate information about period
nrow(unknown.periods) 

# Applying the filter
if(nrow(uncertain.starts) + nrow(unknown.periods) > 0)
{
  invacost <- invacost[-which(invacost$Cost_ID %in% c(uncertain.starts$Cost_ID,
                                                    unknown.periods$Cost_ID)), ]
}

# Number of rows after filtering
nrow(invacost)
```

## How do we filter out unreliable costs?

You have to be aware of the limitations of such a database. This database is
the most up-to-date and comprehensive compilation of both the published
and grey literature on the 
monetary costs of biological invasions. Therefore, it includes sound estimations
of monetary costs of invasive species, with detailed sources and reproducible 
methodology; but it also includes unsourced and irreproducible "*guestimates*",
as those are reported in the literature.
Therefore, we want to apply filters on the database to avoid as much as possible 
any such unreliable cost estimates that can not easily be reproduced or
originate from non-peer reviewed sources. Furthermore, the database also includes 
observed and potential costs, so depending on our objectives we may or may not
want to filter out potential costs.

- **Reliability**: There is a standard field in the database called `Method_reliability`,
which provides a simple yet objective evaluation of the reliability of cost 
estimates. It uses the following decision tree:
`r knitr::include_graphics("./Readme_files/figure-html/reliability.png")`
Red means categorised as unreliable source, green means categorised as reliable
source. This
`Method_reliability` descriptor has some limitations. The most important one 
is that we decided to not evaluate the methodology for peer-reviewed articles 
and official reports, assuming that this was done at an earlier stage (i.e. 
peer-review, prior to publication). We chose such an objective criterion, rather
than assessing the reliability of methods during the process of filling the 
database, because we observed divergence in reliability decisions during 
experiments among members of the 
INVACOST team. We also identified that depending on the study objective, 
different decisions about reliability could be made. Therefore, this
`Method_reliability` descriptor should be considered as a first approximation of 
cost reliability, and you should decide whether or not you want to eliminate 
papers on the basis of the lack of reproducibility of their methodologies.
To do that, take time to investigate the `Details` field (especially for cost
values that you deem suspiciously high) and potentially go back to the source to
make your decision. For an example on how to do that, take a look at the 
"Determining cost estimate reproducibility" section in 
[Bradshaw et al. 2016](https://www.nature.com/articles/ncomms12986#Sec8). 
**Important update as of InvaCost version 3.0:** The database now also includes a 
`Method_reliability_refined` descriptor (not exhaustive as of now) which 
includes an expert-based assessment of reliability. It can be used as an 
alternative or as a complementary approach to the standard `Method_reliability` 
descriptor.

```{r filt1}
unique(invacost$Method_reliability)
```

- **Observed vs. Potential costs**: The `Implementation` field in the database
documents whether the costs correspond to *Observed* or *Potential* costs. 
Choosing to keep one or both of them depends on your study objectives. 
In addition, costs can also be based on
direct *reports* or *estimations*, or can be based on *extrapolations*: this is 
documented in the `Acquisition_method` field. Extrapolation does 
not necessarily mean *potential*: some Observed costs may have been extrapolated
from a reduced spatial extent. Below is a table showing the number of cases
of *extrapolations* and *reports/estimation* for *Observed* and *Potential* costs.  As 
you can see, the majority of Observed costs are based on *reports* / 
*estimations*; yet a few are based on *extrapolations.*
```{r obs}
table(invacost$Acquisition_method, invacost$Implementation)
```


For the rest of this tutorial, we will be working only on costs categorised as
*High* in `Method_reliability` and "Observed" in `Implementation`:

```{r filter1}
invacost <- invacost[which(invacost$Method_reliability == "High"), ]
invacost <- invacost[which(invacost$Implementation == "Observed"), ]

# Number of rows after filtering
nrow(invacost)
```


## How do I know when costs occurred? {#costtime}

A crucial aspect in analysing cost values is to know the periods of time
over which costs have occurred. Indeed, knowing the period over which a cost occurred
allows to derive cumulative costs and to estimate average annual costs; it also 
enables temporal analyses of costs.  
We have stored information on the periods of time over which cost occurred
in two columns: `Probable_starting_year` and `Probable_ending_year`.

However, this information was not readily available in a
substantial portion of the papers we compiled in the database: for
`r nrow(invacost[which(is.na(invacost$Probable_starting_year) |
                      is.na(invacost$Probable_ending_year)), ])` out of 
`r nrow(invacost)` papers (`r round(nrow(invacost[which(is.na(invacost$Probable_starting_year) |
                      is.na(invacost$Probable_ending_year)), ]) / nrow(invacost), 3) * 100`
                      % of papers), this information was not available for at
least one of the two columns.
                      
Therefore, for papers for which it was not available, we made 
educated guesses on the probable starting and ending years.
These educated guesses were based on conservative rules
(e.g., if no duration information was provided,
then the impact was reported of one year only). These estimated starting and
ending years are available in
two new columns in the database called `Probable_starting_year_adjusted` and 
`Probable_ending_year_adjusted`.


As it stands now, each cost has a starting and an ending year, and schematically
it looks like this:


| Cost ID    | Species      | Annual Cost | Starting  year |  Ending year |
|------------|:--------------|--------:|-----------:|--------:|
| 1          | Sp 1         | 100    | 1998      | 2001   |
| 2          | Sp 2         | 15     | 2005      | 2007   |
| 3          | Sp 3         | 3      | 1981      | 1981   |

However, to properly analyse the temporal trends of costs, we need to *expand* 
it, to make it look like this:

| Cost ID    | Species       | Annual Cost | Year |
|------------|:--------------|------------:|-----:|
| 1          | Sp 1          | 100         | 1998 |
| 1          | Sp 1          | 100         | 1999 |
| 1          | Sp 1          | 100         | 2000 |
| 1          | Sp 1          | 100         | 2001 |
| 2          | Sp 2          | 15          | 2005 |
| 2          | Sp 2          | 15          | 2006 |
| 2          | Sp 2          | 15          | 2007 |
| 3          | Sp 3          | 3           | 1981 | 


To do this, we use the function `expandYearlyCosts`, to
which we provide the starting and ending year columns. It will store the
years over which monetary costs have occurred in a column named `Impact_year`. 
In this example, expanded database is here stored in an object called 
`db.over.time`.

```{r i2y, cache = TRUE}
# Expanding and formating the database
db.over.time <- expandYearlyCosts(invacost,
                                  startcolumn = "Probable_starting_year_adjusted",
                                  endcolumn = "Probable_ending_year_adjusted")
# Let's see some columns
head(db.over.time[, c("Cost_ID", "Species",
                      "Cost_estimate_per_year_2017_USD_exchange_rate",
                      "Impact_year")])
```

## How complete are our monetary data?

It is impossible to evaluate the absolute degree of completeness of our InvaCost
datbase - we know that we lack data for many taxonomic groups and so are data
from many places around the globe (most cost data available in InvaCost are 
located in North America and Europe and are cost data primarily available in 
English and to a lesser extent in other languages). You have to be aware of
these potential biases and remember them when interpreting data/analyses.

There is, however, a temporal bias that we can at least evaluate. Indeed, we 
can expect that there is delay between the timing at which the economic impact 
of an invasive species occurs, and the timing at which people will start 
estimating the cost of the impact, which will later be followed by publication
in a report or journal.

We can grasp an idea of this delay by looking at the difference between 
`Impact_year` and `Publication_year` in the expanded database, `db.over.time`.


```{r lag1, cache=TRUE, fig.height = 4, fig.width = 12}
# Calculating time lag
db.over.time$Publication_lag <- db.over.time$Publication_year - db.over.time$Impact_year

# Make a nice boxplot of the time lag
ggplot(db.over.time,
       aes(y = Publication_lag)) +
  geom_boxplot(outlier.alpha = .2) +
  ylab("Publication lag (in years)") + 
  theme_minimal() +
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.text.y = element_blank(),
        axis.text = element_text(size = 16),
        axis.title = element_text(size = 18)) +
  scale_y_continuous(breaks = c(-25, 0, 25, 50, 75, 100),
                     labels = c(-25, 0, 25, 50, 75, 100)) +
  xlab("") +
  coord_flip()
```

*Note that the few occurrences of publications before economic impacts (negative
lag) corresponded to planned budgets over specific periods expanding beyond 
the publication year.*

Our suggestion is to use the quartiles of publication lag as an indication of 
cost data completeness.

The first quartile indicates the delay to reach 25% completeness; 
the median indicates the delay to reach 50% completeness; and the third quartile
indicates the delay to reach 75% completeness.

Let's see these delays in practice:

```{r lag4.1, cache=TRUE}
quantiles <- quantile(db.over.time$Publication_lag, probs = c(.25, .5, .75))
quantiles
```


The median delay between impact and publication of impact was 
**`r median(db.over.time$Publication_lag)` years**. In other words, **for the last 
`r median(db.over.time$Publication_lag)` years, we can expect to have less than 50% 
of the economic impacts**. 

Likewise, we can say it takes approximately `r quantiles[1]` year to have 25% of the 
cost data for any year, and `r quantiles[3]` years to reach at least 75% of 
cost data.

Hence, any analysis on recent years will be based on incomplete data and is 
highly likely to provide an underestimation of actual costs.

It is up to you to determine how you desire to include this information in your 
analyses; here, we will provide examples of how we suggest doing that.


One question that may arise for people working on specific subsets of the 
database (e.g., only one taxon or one country) is whether you should evaluate 
the time lag on your subset. *I would recommend to avoid that, because your 
subset may be too incomplete to provide a decent estimation of the time lag.
**Therefore, we would suggest to evaluate the time lag only on the global dataset, 
as we did here.***

Last, note that there may be a correlation between the time lag and the 
magnitude of cost estimates. Indeed, direct reports of small costs tend to be
aired earlier than large estimates of invasion impacts. It is an important
issue to consider when investigating the time lag. For example, the quantiles
of time lag for all costs above US\$ 1 million annually are `r quantile(db.over.time$Publication_lag[db.over.time$Cost_estimate_per_year_2017_USD_exchange_rate > 1e6], probs = c(.25, .5, .75))` years.


# Approach based on available estimates: cumulative and average costs of invasions

## Basic usage
The first method to analyse monetary costs of invasive species consists in
calculating the observed cumulative and average costs - as they appear in the 
collected materials - over a specific period 
of time, or at different time intervals.

We implemented this method in the function `summarizeCosts`. It will 
calculate the cumulative and average costs for the entire period, as well as
for different time intervals (by default, 10-year intervals).

```{r raw1, cache=TRUE}
obs.costs <- summarizeCosts(db.over.time,
                            maximum.year = 2020)
```

We can get a summary of the results by typing the name of the object in the 
console
```{r raw2, cache=TRUE}
obs.costs
```

And we can have a graphical output with:
```{r raw3, cache=TRUE, fig.width=8, fig.height = 8}
plot(obs.costs)
```

This graph represents the observed annual costs of biological invasions over 
time. Points are total annual costs for every year (i.e., all individual costs
for a specific year are summed). Horizontal bars represent the average annual
cost for a specific interval of time (here, 10-year intervals). Notice how 
average annual costs can sometimes be driven by a limited number of high-cost 
years. The dashed line connects average annual cost of each time interval at mid-years
(black dots). The horizontal dotted line represents the average cost over the 
entire period.

The average annual cost seems to suspiciously drop for the most recent years -
most likely because of the time lag between the occurrence of a cost and its
reporting. Therefore, caution would be required in interpreting the values of 
most recent years.

A good way to illustrate this issue would be to look at the number of estimates
per year.

```{r raw3.}
ggplot(obs.costs$cost.per.year,
       aes(x = year, y = number_estimates)) +
  geom_point() +
  ylab("Number of estimates") +
  xlab("Year") +
  theme_minimal()
```

We can see that the number of estimates increases exponentially, but this
increase seems to slow down/drop after 2015.

We can refine this graph to add the relationship with annual costs:

```{r raw3..}
ggplot(obs.costs$cost.per.year,
       aes(x = year, y = number_estimates,
           size = cost)) +
  geom_point() +
  ylab("Number of estimates") +
  xlab("Year") +
  theme_minimal()
```

**Note, however, that the number of estimates in itself is not a meaningful 
index to describe the sampling effort. Indeed, a single estimate can be 
a complete synthesis over a country, and thus be complete in itself, whereas
many estimates can be direct reports of small localized costs from grey 
litterature, and altogether still represent an incomplete picture of overall
costs.** Notably, recent years have been characterized by a high number of
direct reports from the grey literature, which can result in a false sense of
completeness when only looking at the number estimates (e.g., 
[Angulo et al. 2021](https://doi.org/10.1016/j.scitotenv.2020.144441),
[Renault et al. 2021](https://doi.org/10.3897/neobiota.67.59134)). 


Use this information, in combination with the quantiles of
completeness calculated above, to decide how you want to interpret the last years,
and whether or not you want to omit them in further analyses. Remember that
no choice will be objective here, but also that objectivity does not emerge from
the absence of choice (keeping biased data is a choice in itself). 
We would recommend a transparently explained choice, along with e.g. a
sensitivity analysis.


We can access the content of the output object with
```{r raw3.1, cache=TRUE}
str(obs.costs)
```

Notice that the expanded database used to calculate costs has been stored in
the object, in a slot called `cost.data`. 
This is especially important for reproducibility: in case you
decide to publish your work, you can provide this R object which has the exact 
copy of your specific version/filters of the database.

There are also some other important elements in this object:

 - `parameters`: provides arguments you chose and basic information about your
 dataset
 - `year.breaks`: your time intervals
 - `cost.per.year`: costs for every year in the data with number of estimates
 - `average.total.cost`: contains cumulative and average annual costs for the
 entire time period
 - `average.cost.per.period`: contains cumulative and average annual costs
 for each time interval
 
You can access each element with the `$` sign; for example for the costs
for all time intervals:

```{r raw3.2, cache=TRUE}
obs.costs$average.cost.per.period
```

## Customising parameters

There are two main parameters to customize:

 - **beginning** (`minimum.year`) and **ending year** (`maximum.year`) of the entire 
 time period. For example in our
 analyses for the Nature paper ([Diagne et al. 2021](https://doi.org/10.1038/s41586-021-03405-6))
 we chose to start in 1970, because data for the
 1960s are scarce and uncertain.
```{r raw4, cache=TRUE}
obs.costs2 <- summarizeCosts(db.over.time,
                             minimum.year = 1970,
                             maximum.year = 2017)
obs.costs2
```
The function tells you how many values were removed from the 
dataset because they were outside the 1970-2017 time periods. 

 - **time intervals**: set them with the arguments `year.breaks`, where you specify
 the starting year of each interval. For example, if your specify 
 `year.breaks = c(1970, 1980, 1990, 2000, 2010, 2017)`, then intervals will be
 [1970-1979], [1980-1989], [1990-1999], [2000-2009], [2010-2017]
 
```{r raw4.1, cache=TRUE}
# let's plot 20-year intervals
obs.costs3 <- summarizeCosts(db.over.time,
                                  minimum.year = 1960,
                                  maximum.year = 2017,
                                  year.breaks = seq(1960, 2017, by = 20))
plot(obs.costs3)
```

## Customising graphs

There are two methods to customise graphs. 

- The first one is to use the 
standard ggplot produced by the package and adding things or changing 
parameters. This method is easy to implement but you cannot change everything
(e.g. adjust the colors/shapes of points is not possible). This is what we
will see here. See the help here: `?plot.invacost.costsummary`

- The second one is to make your own ggplot from the output object. It is 
more difficult to implement if you are not familiar with graphs in R - but it
will be fully customisable. 
[Take a look at the scripts from our main paper](http://borisleroy.com/invacost/global_invasion_costs_scripts.html) 
to see how to that.


There are two base plots provided with the package; you have already seen the 
default, and here is another one:

```{r raw5, cache=TRUE}
plot(obs.costs,
     plot.type = "bars")
```

You can also remove the log10 scale:

```{r raw5.1, cache=TRUE}
plot(obs.costs,
     plot.type = "bars",
     cost.transf = NULL)
```

To customise parameters using the standard ggplot produced by the package,
you will have to set the argument `graphical.parameters = "manual"`.

```{r raw5.2, cache=TRUE}
# Store the plot in object p1 to customize it afterwards
p1 <- plot(obs.costs,
           graphical.parameters = "manual")

# Show the graph in its initial state
p1
```

You see that when we specify `graphical.parameters = "manual"`, all the cosmetic choices
we made in the function are removed. You can now choose them by yourself; here 
is a starting point:
```{r raw5.3, cache=TRUE}
# Customize p1 now
p1 <- p1 +
  xlab("Year") + 
  ylab("Average annual cost of invasions in US$ millions") +
  scale_x_continuous(breaks = obs.costs$year.breaks) + # X axis breaks
  theme_bw() + # Minimal theme
  scale_y_log10(breaks = 10^(-15:15), # y axis in log 10 with pretty labels
                labels = scales::comma) +
  annotation_logticks(sides = "l") # log10 tick marks on y axis

# Let's see how it goes
p1
```

Your turn to play with graphs now!

# Modelling approach: estimating the average annual cost of invasions

The second method we provide in the package consists in estimating the long-term
trend in annual cost with different modelling techniques. In other words, we fit
a model to predict costs as a function of years. Then, we can inspect the 
different models and the shapes of cost trends over time, and grasp an idea
of dynamics of invasion costs.

This approach requires more data than the approach based on available estimates 
and for data-poor cases it
will not work or provide inadequate results. 


## Correction for data incompleteness due to publication lag

Because the average annual monetary cost of invasive species will be determined
by the trend over time, we should consider applying a correction to ‘incomplete’ 
years. 

This is no simple decision and your choice will have to be justified.

There are two different methods we can apply here, either independently or in
combination.

- The first method consists in applying a threshold of incompleteness to remove
the most incomplete years. For example, remove 
from calibration all years with < 75% of data; threshold = `r quantiles[3]` years.

- Another possibility includes weighting incomplete years to reduce their importance in
the estimation of average annual costs of invasions. 
This approach can be justified if we do not want to underestimate the average annual
cost of invasions. However, be advised that reviewers may disagree with your 
choices of weights, so justify them carefully, and consider analysing the 
difference with / without weights.


### Example of a vector of incompleteness weights 

An example
to reduce the negative impact of the incompleteness of recent years
would be to apply weights proportional to their degree of incompleteness. For
example, apply the following set of rules:
• completeness ≤ 25%: exclusion 
• 25% < completeness ≤ 50%: weight = 0.25 
• 50% < completeness ≤ 75%: weight = 0.50 
• completeness > 75%: weight = 1


Remember that we stored quantiles in the beginning of this tutorial, so we can
access them now to know to what years they correspond:

```{r incomp}
quantiles
```

In the next lines of code we will create a vector of weights for each year, which we
can provide to the function later on.

```{r lag5, cache = TRUE}
# Creating the vector of weights
year_weights <- rep(1, length(1960:2017))
names(year_weights) <- 1960:2017

# Assigning weights
# Below 25% the weight does not matter because years will be removed
year_weights[names(year_weights) >= (2017 - quantiles["25%"])] <- 0
# Between 25 and 50%, assigning 0.25 weight
year_weights[names(year_weights) >= (2017 - quantiles["50%"]) &
               names(year_weights) < (2017 - quantiles["25%"])] <- .25
# Between 50 and 75%, assigning 0.5 weight
year_weights[names(year_weights) >= (2017 - quantiles["75%"]) &
               names(year_weights) < (2017 - quantiles["50%"])] <- .5

# Let's look at it
year_weights
```


## Assumptions

As we fit several models based on different techniques, we suggest to define
rules for deciding which model(s) should be finally considered.

Here are examples of rules:

  - statistical information about the quality of the fit:
adequate error estimations for models, sensitivity to outliers, the Root Mean
Square Error (RMSE - lower is generally better but it may also 
be indicative of model overfitting), inspection of model terms and residuals
  
  - simplicity: for models with similar performance, we prefer the models with 
less assumptions
  
  - a qualitative rationale on the probable shape of trends over time: because 
of the exponential increase in the number of invasive species globally (Seebens 
et al. 2017), we expect the long-term temporal trend over time to be either 
increasing or stabilising, but not decreasing. Hence, we assume that a model 
describing a decreasing trend in recent years (i.e., for years lower than the
75% completeness threshold) would indicate an effect of the lack of data for 
recent years. 

## Models included in the ensemble modelling

There are several models included in the function. All models are calibrated
with cost data as the response variable and time as predictive variable. Note
that the monetary data have statistical issues typical to *econometric* data:
heteroskedasticity, temporal autocorrelation and outliers. Therefore, our choice of 
modelling methods was driven by methods relatively robust to such issues.

**Note: when using the package, please cite the individual model packages.
Update citations with your package versions! To do that
type for example `citation("earth")` in R**

 - **Ordinary least square regressions** (hereafter called OLS, R package `stats`,
 function `lm`).
 OLS regressions are the classical regression methods used in statistics. 
 Coefficient estimates should be relatively robust to heteroscedasticity &
 temporal autocorrelation but not to outliers. However, error estimates need to
 be corrected, so we used the Heteroskedasticity and Autocorrelation Consistent 
 covariance matrix estimations from the R package
 `sandwich` (Andrews 1991, Zeileis 2004), and we 
 used the function `coeftest` from package `lmtest` (Zeileis & Hothorn 2004) to test for the significance
 of estimates upon this corrected variance covariance matrix. We 
 implemented two OLS methods: 
    + **linear OLS regressions** 
    + **quadratic OLS regressions**
 - **Robust regressions** (R package `robustbase`,
 function `lmrob`, Maechler et al. 2020). We implemented MM-type regressions (hereafter called
 robust regressions) based on iteratively reweighted least squares since these types 
 of regressions are less sensitive to outliers than ordinary least square 
 regressions (and we do have outliers in InvaCost!) (Yohai 1987, Koller and 
 Stahel 2011). In addition, this method estimates standard errors robust to heteroskedasticity and autocorrelation as described in Croux et al. 2003. *Thanks 
 Andrew Kramer for this addition*. Likewise to OLS regressions, we implemented 
 two methods:
    + **linear robust regressions** 
    + **quadratic robust regressions**
 - **Multivariate adaptive regression splines** (Multivariate Adaptive Regression 
 Splines, MARS, R package `earth`, function `earth`, Milborrow 2018). The MARS 
 model is a non-parametric regression method which automatically models
 nonlinearities, using Generalized Cross-Validation to avoid overfitting 
 (Friedman 1991, Hastie et al. 2009). The default parameters are implemented in
 order to follow Friedman's default, as described in 
 [Milborrow (2019a).](http://www.milbo.org/doc/earth-notes.pdf)
 We account for heteroskedasticity by
 fitting a linear model on the residuals. This linear model is fitted by 
 Iteratively Reweighting Least Squares. Note, however, that we have enough
 data to only approximately model variance, as explained in 
 [Milborrow (2019b)](http://www.milbo.org/doc/earth-varmod.pdf). Therefore, MARS
 models will have more uncertainty in the prediction intervals than in the
 predictions themselves.
 - **Generalized additive models** (Generalized Additive Models, GAM, R package
 `mgcv`, function `gam`, Wood et al. 2016). GAM models are automatic flexible statistical methods 
 used to identify and characterize nonlinear regression effects 
 (Hastie et al. 2009).
 The GAM model will also show non-linear patterns in cost trends over 
 time. To account for heteroskedasticity, we used a
 location-scale method which consists in fitting two GAMs, one for the average
 trend and one for the standard deviation. We used a simple Gaussian location
 scale family because, likewise to GAMs, we have enough
 data to only approximately model variance. 
 - **Quantile regressions** (R package `quantreg`, Koenker et al. 2020).
 Contrary to previous models, 
 quantile regressions do not try to estimate the average value, they estimate
 a specific quantile. In the package we chose quantiles 0.1 (lower boundary
 of annual costs), 0.5 (median cost value) and 0.9 (upper boundary of annual costs).

**References:**

Andrews DWK (1991), Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation. _Econometrica_, *59*, 817–858.

Croux, C., Dhaene, G. and Hoorelbeke, D. (2003) Robust standard errors for robust estimators, _Discussion Papers Series 03.16_, K.U. Leuven, CES.

Friedman, J. H. (1991). "Multivariate Adaptive Regression Splines". _The Annals of Statistics._ *19* (1): 1–67.

Hastie T, Tibshirani R, Friedman J. (2009)  The elements of statistical learning: data mining, inference, and prediction. Springer, New York City, USA

Koenker, R. quantreg: Quantile Regression. R package version 5.61. Available at 
http://CRAN.R-project.org/package=quantreg (2020). 

Koller, M. and Stahel, W.A. (2011) Sharpening Wald-type inference in robust regression for small samples. Computational _Statistics & Data Analysis_ *55*(8), 2504–2515

Maechler M, Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Koller M, Eduardo LT,  Conceicao and di Palma MA (2020). robustbase: Basic Robust Statistics R package version 0.93-6. URL http://CRAN.R-project.org/package=robustbase

Milborrow, S. earth: Multivariate Adaptive regression Splines, R package version 4.6. 3 (2018).

Milborrow S. (2019a) Notes   on   earth   package.   Vignettes   of   the R package ‘earth’, http://www.milbo.org/doc/earth-notes.pdf

Milborrow S. (2019b) Variance models in earth.   Vignettes   of   the R package ‘earth’, http://www.milbo.org/doc/earth-varmod.pdf

Wood, S.N., N. Pya and B. Saefken (2016), Smoothing parameter and model selection for general smooth models. _Journal of the American Statistical Association_ *111*, 1548-1575 

Yohai, V., Stahel, W.A. and Zamar, R. (1991) A procedure for robust estimation and inference in linear regression; in Stahel and Weisberg (eds), _Directions in Robust Statistics and Diagnostics, Part II_, Springer, New York, 365–374

Zeileis A (2004). “Econometric Computing with HC and HAC Covariance Matrix Estimators.” _Journal of Statistical Software_, *11*(10), 1-17.

Zeileis A, Hothorn T (2002). Diagnostic Checking in Regression Relationships. _R News_ *2*(3), 7-10. URL https://CRAN.R-project.org/doc/Rnews/

**A few remarks:**

Paying attention to the statistical robustness of the model is important. 
However, remember that even if we had found a model that has an ideal fit to the
data, with no statistical violations whatsoever, we should absolutely not 
consider the predictions as accurate values of the annual costs of IAS, because 
of the assumption that there is a substantial part of costs that have not been
estimated or reported in InvaCost The usefulness of these models is to 
investigate whether the costs are increasing, stabilizing, or decreasing, 
and how these trends changed over time, and to have an idea of the order of 
magnitude of annual IAS costs depending on time. 

Therefore, remember that the major uncertainty we have on estimated cost values
does not necessarily come from statistical robustness but rather from the 
unknown missing part of data.

It may not seem critical at this stage to have perfect statistics as long as 
the different methods do converge in their results. This is the general 
philosophy of fitting multiple models here, which is called an ensemble 
modelling procedure: compare if different methods converge in their results, 
so we have confidence that the result we observe is not a statistical artifact. 


Notwithstanding, we suggest to take time to carefully inspect your models, as 
we illustrate in the following sections.

## Model fitting

The function called `modelCosts` will fit all models automatically. We 
can also provide several parameters such as 

- starting year (`minimum.year`): defaults to 1960
- ending year (`maximum.year`): defaults to last year of the input data
- cost transformation (`cost.transf`): by default, a log10 transformation will
be applied
- costs in millions (`in.millions`): by default, costs are transformed in 
millions so numbers are easier to read
- threshold (`incomplete.year.threshold`) and/or weights 
(`incomplete.year.weights`) for incomplete years (nothing by default)
- number of parameters for GAM (dimension basis `gam.k`) and MARS (number of
model terms `mars.nprune`). **We suggest you should not alter these parameters,
except if you have good reasons to change them.** Notably, in some very rare 
cases, GAM models can compute forever, in which case, try to define a small 
value for gam.k (e.g. start from 10 and reduce until the GAM fits).
- the function will conveniently print the annual cost value estimated by
all models for a single year, usually the last year. You can change this by
defining `final.year` (defaults to last year of the input data). Do not worry,
values are estimated
for all years, this is mostly to provide a summary inspection in the console. 


In following example we have decided to exclude the five most recent years 
because they are likely to be the least complete years and they also contain 
obvious outliers as you will see on the graphs. You could also decide to 
manually exclude specific outliers by setting their weights to zero.

Here is an example in action:
```{r models}
global.trend <- modelCosts(
  db.over.time, # The EXPANDED database
  minimum.year = 1970, 
  maximum.year = 2020,
  incomplete.year.threshold = 2015)

# Let's see the results in the console
global.trend
```


## Model plotting


We can now look at the shape of each model with the `plot` function, which uses
ggplot2 internally in the invacost R package:

```{r models2, fig.height=7, fig.width=10}
plot(global.trend)
```

We can see that all models generally converged in their predictions, suggesting
that the global costs of biological invasions have increased over the past
50 years. Different models suggest different patterns, as illustrated by 
the linear vs. quadratic models, the GAM and MARS models.
The trend illustrated by the MARS model is non-monotonous, suggesting a decrease
between 2000 and 2010. This apparent decrease provides insight into our data.
It may either indicate an actual decrease pattern in global invasion costs,
an illustration of the high dispersion of costs after 2000, or the 
incompleteness of data after 2000. Given our prior assumptions, we may hypothesise
that the two latter hypotheses are more probable than the former one. 

**Important.** Note that error bands for MARS are 95% prediction intervals, whereas
for all other models they are 95% confidence intervals. Indeed, we cannot estimate
confidence intervals for MARS models.

Quantile regressions indicate a median (quantile 0.5) lower than the linear regression,
suggesting that most years have a lower-than-average cost of invasions, and a 
few years have very high costs. The 0.1 and 0.9 quantiles have diverging trends,
suggesting that the amplitude of costs increases over time (more discrepancy
between high-cost and low-cost years).

Remember to check your model summaries before going into more interpretations.
Some models may not be significant.


## Inspecting model outputs

We can access the content of the output object with
```{r models3, cache=TRUE}
str(global.trend)
```

There are several elements in the output object:

- `cost.data` contains total annual costs per year, upon which models were fitted
- `parameters` contains the parameters used to run the function 
- `fitted.models` contains all the objects of the fitted models. You can 
access models individually from here, look at the parameters, etc.
- `estimated.annual.costs` contains cost values predicted by all models for
each year, with confidence intervals
- `gam.predicted.variance` contains the predicted variance values for the
location-scale GAM model
- `model.summary` contains useful statistical information about models: r-squared,
adjusted r-squared, terms, significance, errors, etc.
- `RMSE` contains the root mean square error of all models, both for the 
calibration data only and for all data points
- `final.year.cost` is the cost value predicted by models for `final.year`

## How to  evaluate models?

We provide an example here where we rely on the different assumptions stated 
above.

First, we can look at model RMSE:

```{r models4}
global.trend$RMSE
```

Overall, the MARS model provided a closer fit to data points than 
other models. **Remember that RMSE is irrelevant for quantile regression
because it does not seek to fit the average trend.**

Hence, purely from a statistical point of view, we would tend to prioritize 
MARS models over other models. However, remember that the MARS model illustrated
the uncertain behaviour of our data after 2000, and we may decide that,
based on our a-priori assumptions, this is not the most adequate model, and
we will also include the other models (who all have similar RMSE) in our predictions.


Now, some models may seem to have good fit but may not be significant. This
is highly critical especially for data-poor subsets of InvaCost To verify this,
inspect the model summaries:

<details>
  <summary> <b> Click here to see the model summary</b> </summary>

```{r models8}
global.trend$model.summary
```
</details>

This is quite long because of the many models we fitted! Inspecting these
outputs requires some statistical knowledge, so do not hesitate to read the
help files and papers mentioned in the **Models included** section.

If we start reading from the top, we can see that for OLS regressions, 
coefficients of the linear model are significant, but coefficients of the
quadratic model are not! So we should exclude this model from further analysis.

Looking at robust regressions, coefficients seem significant for linear
terms, whereas they could not be calculated for quadratic terms in our example.
The model outputs also indicate us that some outliers
were found in both cases (their weight was set to zero). This is useful if your
dataset has a lot of outliers!

The GAM output states that both the mean and sd model were significant. It is 
more difficult to interpret GAMs solely based on this output, and you should
probably inspect model residuals visually etc. However, remember that we probably
do not have enough data to fit adequate GAMs, so be cautious with the interpretation.

The MARS output illustrates the different hinge functions. We can see the
years that serve as "breaking" point in the model, as well as the coefficients
of slopes after these years. It also shows the coefficients used to measure 
model performance, such as the Generalized Cross-Validation (GCV) which is the
residual sum of squares penalized by the effective number of model parameters,
or the GRSq, which is the R² normalizing the GCV (i.e., the higher the GRSq,
the better it is, and if you have a negative GRSq, it is very bad news). For more
information, see sections 13.6 and 13.11 in the 
[earth notes](http://www.milbo.org/doc/earth-notes.pdf).


## Including model summary in your paper

It may be difficult to report these vastly different model summaries in your
paper, so we have made a convenient function to summarize the important 
aspects of model summaries into a single table. You can then write this table
as a CSV file for example, and format it to include in your paper.

<details>
  <summary> <b> Click here to see the pretty summary</b> </summary>

```{r models8.1}
summarized.summary <- prettySummary(global.trend)

summarized.summary
```
</details>


## Customising graphs


Customising plots for the modelling approach similar to the approach based on 
available estimates: 
there are two options. 


- The first one is to use the 
standard ggplot produced by the package and adding things or changing 
parameters. This method is easy to implement but you cannot change everything
(e.g. adjust the colors/shapes of points is not possible). This is what we
will see here. See the help here: `?plot.invacost.costmodel`

- The second one is to make your own ggplot from the output object. It is 
more difficult to implement if you are not familiar with graphs in R - but it
will be fully customisable. 
[Take a look at the scripts from our main paper](http://borisleroy.com/invacost/global_invasion_costs_scripts.html)  to see how to that.


### Simple customisation

There are two base plots provided with the package; you have already seen the 
default, and here is another one where all models are on a single facet:

```{r models5}
plot(global.trend,
     plot.type = "single")
```


Remember that we found that the quadratic models were not significant or not
working properly in our examples?

We can exclude them from model outputs:


```{r models52, fig.width = 16, fig.height = 9}
plot(global.trend,
     models = c("ols.linear", "robust.linear", "gam",
    "mars", "quantile"))
```


### Complex customisation

Likewise to the approach based on available estimates, if you want to customise ggplot parameters,
you have to set `graphical.parameters = "manual"`.

```{r models6, cache=TRUE}
# Store the plot in object p2 to customize it afterwards
p2 <- plot(global.trend,
           graphical.parameters = "manual")

# Show the graph in its initial state
p2
```

Ugly isn't it? That's mostly because the y scale is not in log10 scale.

You can now set all parameters by yourself; here is a starting point:
```{r models7, cache=TRUE}
# Customize p2 now
p2 <- p2 +
  xlab("Year") + 
  ylab("Average annual cost of invasions in US$ millions") +
  scale_x_continuous(breaks = obs.costs$year.breaks) + # X axis breaks - notice I used the breaks from obs.costs
  theme_bw() + # Minimal theme
  scale_y_log10(breaks = 10^(-15:15), # y axis in log 10 with pretty labels
                labels = scales::comma) +
  annotation_logticks(sides = "l") # log10 tick marks on y axis

# Let's see how it goes
p2
```

Tadaaam!

# Example on a specific subset: mammals of North America

This is just an illustration to show you how to make it work on a subset of the
database. We are not going to analyse it in details.

We assume here that you have run the code on the first sections of this 
tutorial: database filtering and completeness.

First, let's inspect the geographic regions of InvaCost:
```{r example1.1}
unique(invacost$Geographic_region)
```

There are seven different regions in INVACOST, and sometimes cost are spread over 
different regions. Indeed, cost estimates in publications and reports often
correspond to data aggregated over several regions, several taxa, several types
of cost etc. Most of the time, it is not possible to 
split these costs into their respective subsets. Therefore, we have to omit them
if we want to focus on a single region. Here, we focus on North America only:

```{r example1.2}
invacost.NA <- invacost[which(invacost$Geographic_region == "North America"), ]

# Number of rows
nrow(invacost.NA)
```

We have `r nrow(invacost.NA)` lines in the North America subset of InvaCost 

```{r example1.3}
# Let's see the content of the Class column
unique(invacost.NA$Class)
# Subset the NA InvaCost database
invacost.NA.mammals <- invacost.NA[which(invacost.NA$Class == "Mammalia"), ]

# Number of rows
nrow(invacost.NA.mammals)
```

Once again, there are studies involving multiple taxa, and here we focus only on
mammals.

There are only 63 rows in this subset, which may not be 
sufficient to run the modelling approach. Let's confirm this by starting with
the approach based on available estimates:


```{r example1.4}
# Expand the subset
NAmammals.over.time <- expandYearlyCosts(invacost.NA.mammals,
                                         startcolumn = "Probable_starting_year_adjusted",
                                         endcolumn = "Probable_ending_year_adjusted")


obs.NAmammals <- summarizeCosts(NAmammals.over.time,
                                    minimum.year = 1970)

obs.NAmammals

plot(obs.NAmammals)
```

Indeed, looking at the graph it would be ill-advised to calibrate models on this
subset of InvaCost (feel free to try it!).
We should rather focus on the cumulative cost over time, 
which in our case amounts to 
US$ `r scales::comma(obs.NAmammals$average.total.cost$total_cost)` millions for 
the 1970-2017 time period.


# Example on many subsets: all taxa/species in the database

This is a more complex situation where we want to derive a single estimate for 
all species/taxon in the database.

First, we need to inspect the taxonomic columns of the database to decide whether
we want to apply changes before running the script.

```{r allsp1}
# Here we just show the first 25
unique(invacost$Species)[1:25]
```

As you can see, there are many cases where multiple species are studied together.
These cases will be difficult to implement/analyse, but we can decide to merge
some of them together. For example, rats (genus *Rattus*) are often not
identified at the species level, or multiple species were reported concomitantly.
Therefore, we can decide to merge them together:


```{r allsp2}
# If we modify species we change them in a new column called sp.list
invacost$sp.list <- invacost$Species
# If we modify genera we change them in a new column called genus.list
invacost$genus.list <- invacost$Genus

# There are many species of Rattus which are sometimes together, sometimes not, 
# and oftentimes not identified to the species level. Therefore, we can decide
# to merge them together in a single species complex
# Species column
invacost$sp.list[which(invacost$Genus == "Rattus")] <- "Rattus sp."
```

We can also do that for other taxa; here are the changes we applied for the
Nature paper ([Diagne et al. 2021](https://doi.org/10.1038/s41586-021-03405-6)):

```{r allsp3}
invacost$sp.list[which(invacost$Genus == "Aedes")] <- "Aedes spp."
invacost$sp.list[which(invacost$Genus == "Canis")] <- "Canis lupus spp."
```

Note that there are many other taxonomic peculiarities that would need *ad hoc*
corrections. Ensure that you have analysed your list of species (`unique(invacost$Species)`) 
and genera (`unique(invacost$Genus)`) before you proceed with the rest of the 
analysis.


Now that our taxon group list is set up, we still need to create a unique identifier 
for each taxon group. Indeed, if we use directly the `Species` column, then
all taxa that have the generic name *Diverse/Unspecified* will be considered 
in the same group. However, even if species names were unspecified, we still 
have information at high taxonomic levels (e.g., mammals, plants, etc.). 
Therefore, we will use all the taxonomic information to make sure that we do not
mix costs from e.g. plants with costs of mammals even when the species is 
marked as *Diverse/Unspecified*.

So we will create a 
unique identifier which integrates taxonomic data to avoid mixing together 
different kingdoms/phyla/classes etc.

```{r allsp4}
# Unique identifier
invacost$unique.sp.id <- do.call("paste", invacost[, c("Kingdom", "Phylum", "Class", "Family", "genus.list", "sp.list")])
```

Finally, we will write a loop that will cycle through all these unique groups,
and for each group, calculate the cumulative cost and average annual cost
for the 1970-2017 time period. 

```{r allsp5, warning=FALSE, cache = TRUE}
# First we expand the database
db.over.time <- expandYearlyCosts(invacost,
                                  startcolumn = "Probable_starting_year_adjusted",
                                  endcolumn = "Probable_ending_year_adjusted")

# We subset this data to the 1970 - 2020 time period
db.over.time <- db.over.time[which(db.over.time$Impact_year >= 1970 &
                                     db.over.time$Impact_year <= 2020), ]


# Then we prepare a data.frame in which we will store our results
species.summary <- data.frame()
# We will cycle the loop through all unique identifiers
for(sp in unique(db.over.time$unique.sp.id))
{
  # We subset the database for our current species
  cur.db <- db.over.time[which(db.over.time$unique.sp.id %in% sp), ]
  
  # We apply the raw cost function
  cur.raw <- summarizeCosts(cur.db)
  
  
  # And from the cur.raw object we extract the specific information we are looking for
  species.summary <- rbind.data.frame(species.summary,
                                      data.frame(
                                        Kingdom = cur.db$Kingdom[1],
                                        Phylum = cur.db$Phylum[1],
                                        Class = cur.db$Class[1],
                                        Family = cur.db$Family[1],
                                        Genus = cur.db$Genus[1],
                                        Species = cur.db$sp.list[1],
                                        Average.annual.cost = cur.raw$average.total.cost$annual_cost,
                                        Cumulated.cost = cur.raw$average.total.cost$total_cost,
                                        Number.estimates = cur.raw$average.total.cost$number_estimates,
                                        Number.year.values = cur.raw$average.total.cost$number_year_values
                                      ))
}

# To make the summary dataframe nicer, we can sort by cost to have the highest groups first
species.summary <- species.summary[order(species.summary$Cumulated.cost, decreasing = TRUE), ]


# Have a look at the first groups
species.summary[1:10, ]
```

Of course, many lines in this table are not interesting because they correspond 
to all studies covering multiple taxa. 


[For a more detailed example, please look at the scripts we provided with the
main paper here - note however that these were performed on an earlier version 
of the database, and corrections happened since.](http://borisleroy.com/invacost/global_invasion_costs_scripts.html).


# Improving the package

If something is not clear, or missing, or if you would like to request a 
feature, please open an issue on
[our GitHub repository](https://github.com/Farewe/invacost/issues).

However, remember that we need to find a good balance
between generalisability and specificity: not enough parameters and users are not
happy; too many parameters and users are lost in the function usage. Therefore,
if you have a very specific request that will not be useful to other users, do
not hesitate to duplicate the source code and adapt the function to your needs.
On the contrary, if you think of a new thing that could be beneficial to many
users, please do not hesitate and become an official contributor to the
package!


# Citation

If you found the package and/or the tutorial useful, please do not hesitate to
cite the package (in addition to the papers) as an acknowledgement for the
time spent in writing the package and this tutorial. Like all R packages, 
to know how to cite it, type:
```{r i4}
citation("invacost")
```