06-programming.Rmd

# R programming {#progchapter}

The tools in Chapters \@ref(thebasics)-\@ref(messychapter) will allow you to manipulate, summarise and visualise your data in all sorts of ways. But what if you need to compute some statistic that there isn't a function for? What if you need automatic checks of your data and results? What if you need to repeat the same analysis for a large number of files? This is where the programming tools you'll learn about in this chapter, like loops and conditional statements, come in handy. And this is where you take the step from being able to use R for routine analyses to being able to use R for _any_ analysis.

After working with the material in this chapter, you will be able to use R to:

* Write your own R functions,
* Use several new pipe operators,
* Use conditional statements to perform different operations depending on whether or not a condition is satisfied,
* Iterate code operations multiple times using loops,
* Iterate code operations multiple times using functionals,
* Measure the performance of your R code.


## Functions
Suppose that we wish to compute the mean of a vector `x`. One way to do this would be to use `sum` and `length`:

```{r eval=FALSE}
x <- 1:100
# Compute mean:
sum(x)/length(x)
```

Now suppose that we wish to compute the mean of several vectors. We could do this by repeated use of `sum` and `length`:

```{r eval=FALSE}
x <- 1:100
y <- 1:200
z <- 1:300

# Compute means:
sum(x)/length(x)
sum(y)/length(y)
sum(z)/length(x)
```

But wait! I made a mistake when I copied the code to compute the mean of `z` - I forgot to change `length(x)` to `length(z)`! This is an easy mistake to make when you repeatedly copy and paste code. In addition, repeating the same code multiple times just doesn't look good. It would be much more convenient to have a single function for computing the means. Fortunately, such a function exists - `mean`:

```{r eval=FALSE}
# Compute means
mean(x)
mean(y)
mean(z)
```

As you can see, using `mean` makes the code shorter and easier to read and reduces the risk of errors induced by copying and pasting code (we only have to change the argument of one function instead of two).

You've already used a ton of different functions in R: functions for computing means, manipulating data, plotting graphics, and more. All these functions have been written by somebody who thought that they needed to repeat a task (e.g. computing a mean or plotting a bar chart) over and over again. And in such cases, it is much more convenient to have a function that does that task than to have to write or copy code every time you want to do it. This is true also for your own work - whenever you need to repeat the same task several times, it is probably a good idea to write a function for it. It will reduce the amount of code you have to write and lessen the risk of errors caused by copying and pasting old code. In this section, you will learn how to write your own functions.

### Creating functions {#createfunctions}
For the sake of the example, let's say that we wish to compute the mean of several vectors but that the function `mean` doesn't exist. We would therefore like to write our own function for computing the mean of a vector. An R function takes some variables as input (arguments or parameters) and returns an object. Functions are defined using `function`\index{function!define}\index{\texttt{function}}\index{\texttt{return}}. The definition follows a particular format:

```{r eval=FALSE}
function_name <- function(argument1, argument2, ...)
{
      # ...
      # Some rows with code that creates some_object
      # ...
      return(some_object)
}
```

In the case of our function for computing a mean, this could look like:

```{r eval=FALSE}
average <- function(x)
{
      avg <- sum(x)/length(x)
      return(avg)
}
```

This defines a function called `average`, that takes an object called `x` as input. It computes the sum of the elements of `x`, divides that by the number of elements in `x`, and returns the resulting mean.

If we now make a call to `average(x)`, our function will compute the mean value of the vector `x`. Let's try it out, to see that it works:

```{r eval=FALSE}
x <- 1:100
y <- 1:200
average(x)
average(y)
```


### Local and global variables
Note that despite the fact that the vector was called `x` in the code we used to define the function, `average` works regardless of whether the input is called `x` or `y`. This is because R distinguishes between _global variables_ and _local variables_\index{variable!global}\index{variable!local}. A global variable is created in the _global environment_ outside a function, and is available to all functions (these are the variables that you can see in the Environment panel in RStudio). A local variable is created in the _local environment_ inside a function, and is only available to that particular function. For instance, our `average` function creates a variable called `avg`, yet when we attempt to access `avg` after running `average` this variable doesn't seem to exist:

```{r eval=FALSE}
average(x)
avg
```

Because `avg` is a local variable, it is only available inside of the `average` function. Local variables take precedence over global variables inside the functions to which they belong. Because we named the argument used in the function `x`, `x` becomes the name of a local variable in `average`. As far as `average` is concerned, there is only one variable named `x`, and that is whatever object that was given as input to the function, regardless of what its original name was. Any operations performed on the local variable `x` won't affect the global variable `x` at all.

Functions can access global variables:
```{r eval=FALSE}
y_squared <- function()
{
      return(y^2)
}

y <- 2
y_squared()
```

But operations performed on global variables inside functions won't affect the global variable:

```{r eval=FALSE}
add_to_y <- function(n)
{
      y <- y + n
}

y <- 1
add_to_y(1)
y
```

Suppose you really need to change a global variable inside a function^[Do you _really_?]. In that case, you can use an alternative assignment operator, `<<-`\index{\texttt{<<-}}, which assigns a value to the variable in the _parent environment_ to the current environment. If you use `<<-` for assignment inside a function that is called from the global environment, this means that the assignment takes place in the global environment. But if you use `<<-` in a function (function 1) that is called by another function (function 2), the assignment will take place in the environment for function 2, thus affecting a local variable in function 2. Here is an example of a global assignment using `<<-`:

```{r eval=FALSE}
add_to_y_global <- function(n)
{
      y <<- y + n
}

y <- 1
add_to_y_global(1)
y
```

### Will your function work? {#willfunctionwork}
It is always a good idea to test if your function works as intended, and to try to figure out what can cause it to break. Let's return to our `average` function:

```{r eval=FALSE}
average <- function(x)
{
      avg <- sum(x)/length(x)
      return(avg)
}
```

We've already seen that it seems to work when the input `x` is a numeric vector. But what happens if we input something else instead?

```{r eval=FALSE}
average(c(1, 5, 8)) # Numeric input
average(c(TRUE, TRUE, FALSE)) # Logical input
average(c("Lady Gaga", "Tool", "Dry the River")) # Character input
average(data.frame(x = c(1, 1, 1), y = c(2, 2, 1))) # Numeric df
average(data.frame(x = c(1, 5, 8), y = c("A", "B", "C"))) # Mixed type
```

The first two of these render the desired output (the `logical` values being represented by 0's and 1's), but the rest don't. Many R functions include checks that the input is of the correct type, or checks to see which method should be applied depending on what data type the input is. We'll learn how to perform such checks in Section \@ref(conditions).

As a side note, it is possible to write functions that don't end with `return`. In that case, the output (i.e. what would be written in the Console if you'd run the code there) from the last line of the function will automatically be returned. I prefer to (almost) always use `return` though, as it is easy to accidentally make the function return nothing by finishing it with a line that yields no output. Below are two examples of how we could have written `average` without a call to `return`. The first doesn't work as intended, because the function's final (and only) line doesn't give any output.

```{r eval=FALSE}
average_bad <- function(x)
{
      avg <- sum(x)/length(x)
}

average_ok <- function(x)
{
      sum(x)/length(x)
}

average_bad(c(1, 5, 8))
average_ok(c(1, 5, 8))
```

### More on arguments
It is possible to create functions with as many arguments as you like, but it will become quite unwieldy if the user has to supply too many arguments to your function. It is therefore common to provide default values to arguments, which is done by setting a value in the `function` call\index{function!default value of argument}. Here is an example of a function that computes $x^n$, using $n=2$ as the default:

```{r eval=FALSE}
power_n <- function(x, n = 2)
{
      return(x^n)
}
```

If we don't supply `n`, `power_n` uses the default `n = 2`:

```{r eval=FALSE}
power_n(3)
```

But if we supply an `n`, `power_n` will use that instead:

```{r eval=FALSE}
power_n(3, 1)
power_n(3, 3)
```

For clarity, you can specify which value corresponds to which argument:

```{r eval=FALSE}
power_n(x = 2, n = 5)
```

...and can then even put the arguments in the wrong order:

```{r eval=FALSE}
power_n(n = 5, x = 2)
```

However, if we only supply `n` we get an error, because there is no default value for `x`:

```{r eval=FALSE}
power_n(n = 5)
```

It is possible to pass a function as an argument\index{function!function as argument}. Here is a function that takes a vector and a function as input, and applies the function to the first two elements of the vector:

```{r eval=FALSE}
apply_to_first2 <- function(x, func)
{
      result <- func(x[1:2])
      return(result)
}
```

By supplying different functions to `apply_to_first2`, we can make it perform different tasks:

```{r eval=FALSE}
x <- c(4, 5, 6)
apply_to_first2(x, sqrt)
apply_to_first2(x, is.character)
apply_to_first2(x, power_n)
```

But what if the function that we supply requires additional arguments? Using `apply_to_first2` with `sum` and the vector `c(4, 5, 6)` works fine:

```{r eval=FALSE}
apply_to_first2(x, sum)
```

But if we instead use the vector `c(4, NA, 6)` the function returns `NA` :

```{r eval=FALSE}
x <- c(4, NA, 6)
apply_to_first2(x, sum)
```

Perhaps we'd like to pass `na.rm = TRUE` to `sum` to ensure that we get a `numeric` result, if at all possible. This can be done by adding `...` to the list of arguments for both functions\index{function!\texttt{...} argument}\index{\texttt{...}}, which indicates additional parameters (to be supplied by the user) that will be passed to ` func`:

```{r eval=FALSE}
apply_to_first2 <- function(x, func, ...)
{
      result <- func(x[1:2], ...)
      return(result)
}

x <- c(4, NA, 6)
apply_to_first2(x, sum)
apply_to_first2(x, sum, na.rm = TRUE)
```

$$\sim$$

```{exercise, label="ch6exc1"}
Write a function that converts temperature measurements in degrees Fahrenheit to degrees Celsius, and apply it to the `Temp` column of the `airquality` data.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions1)


<br>

```{exercise, label="ch6exc2"}
Practice writing functions by doing the following:

1. Write a function that takes a vector as input and returns a vector containing its minimum and maximum, without using `min` and `max`.

2. Write a function that computes the mean of the squared values of a vector using `mean`, and that takes additional arguments that it passes on to `mean` (e.g. `na.rm`).

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions2)


### Namespaces
It is possible, and even likely, that you will encounter functions in packages with the same name as functions in other packages. Or, similarly, that there are functions in packages with the same names as those you have written yourself. This is of course a bit of a headache, but it's actually something that can be overcome without changing the names of the functions. Just like variables can live in different environments, R functions live in _namespaces_\index{namespace}\index{function!namespace}, usually corresponding to either the global environment or the package they belong to. By specifying which namespace to look for the function in, you can use multiple functions that all have the same name.

For example, let's create a function called `sqrt`. There is already such a function in the `base`\index{\texttt{base}} package^[`base` is automatically loaded when you start R, and contains core functions such as `sqrt`.] (see `?sqrt`), but let's do it anyway:

```{r eval=FALSE}
sqrt <- function(x)
{
      return(x^10)
}
```

If we now apply `sqrt` to an object, the function that we just defined will be used:

```{r eval=FALSE}
sqrt(4)
```

But if we want to use the `sqrt` from `base`, we can specify that by writing the namespace (which almost always is the package name) followed by `::`\index{\texttt{::}} and the function name:

```{r eval=FALSE}
base::sqrt(4)
```

The `::` notation can also be used to call a function\index{package!use function from without loading} or object from a package without loading the package's namespace:

```{r eval=FALSE}
msleep # Doesn't work if ggplot2 isn't loaded
ggplot2::msleep # Works, without loading the ggplot2 namespace!
```

When you call a function, R will look for it in all active namespaces, following a particular order. To see the order of the namespaces, you can use `search`\index{\texttt{search}}:

```{r eval=FALSE}
search()
```

Note that the global environment is first in this list - meaning that the functions that you define always will be preferred to functions in packages.

All this being said, note that it is bad practice to give your functions and variables the same names as common functions. Don't name them `mean`, `c` or `sqrt`. Nothing good can ever come from that sort of behaviour.

Nothing.

### Sourcing other scripts
If you want to reuse a function that you have written in a new script, you can of course copy it into that script. But if you then make changes to your function, you will quickly end up with several different versions of it. A better idea can therefore be to put the function in a separate script, which you then can call in each script where you need the function. This is done using `source`\index{\texttt{source}}. If, for instance, you have code that defines some functions in a file called `helper-functions.R` in your working directory, you can run it (thus defining the functions) when the rest of your code is run by adding `source("helper-functions.R")` to your code.

Another option is to create an R package containing the function, but that is beyond the scope of this book. Should you choose to go down that route, I highly recommend reading [_R Packages_](https://r-pkgs.org/index.html) by Wickham and Bryan.


## More on pipes {#morepipes}
We have seen how the `magrittr` pipe `%>%` can be used to chain functions together. But there are also other pipe operators that are useful. In this section we'll look at some of them, and see how you can create functions using pipes.

### _Ce ne sont pas non plus des pipes_
Although `%>%` is the most used pipe operator\index{\texttt{pipe}}, the `magrittr` package provides a number of other pipes that are useful in certain situations.

One example is when you want to pass variables rather than an entire dataset to the next function. This is needed for instance if you want to use `cor` to compute the correlation between two variables, because `cor` takes two vectors as input instead of a data frame. You can do it using ordinary `%>%` pipes:

```{r eval=FALSE}
library(magrittr)
airquality %>%
      subset(Temp > 80) %>%
      {cor(.$Temp, .$Wind)}
```

However, the curly brackets `{}` and the dots `.` makes this a little awkward and difficult to read. A better option is to use the `%$%`\index{\texttt{\%\$\%}} pipe, which passes on the names of all variables in your data frame instead: 

```{r eval=FALSE}
airquality %>% 
      subset(Temp > 80) %$% 
      cor(Temp, Wind)
```

If you want to modify a variable using a pipe, you can use the _compound assignment_ pipe `%<>%`\index{\texttt{\%<>\%}}. The following three lines all yield exactly the same result:

```{r eval=FALSE}
x <- 1:8;   x <- sqrt(x);        x
x <- 1:8;   x %>% sqrt -> x;     x   
x <- 1:8;   x %<>% sqrt;         x
```

As long as the first pipe in the pipeline is the compound assignment operator `%<>%`, you can combine it with other pipes:

```{r eval=FALSE}
x <- 1:8
x %<>% subset(x > 5) %>% sqrt
x
```

Sometimes you want to do something in the middle of a pipeline, like creating a plot, before continuing to the next step in the chain. The _tee_ operator `%T>%`\index{\texttt{\%T>\%}}\index{pipe!plot in chain} can be used to execute a function without passing on its output (if any). Instead, it passes on the output to its left. Here is an example:

```{r eval=FALSE}
airquality %>% 
      subset(Temp > 80) %T>%
      plot %$% 
      cor(Temp, Wind)
```

Note that if we'd used an ordinary pipe `%>%` instead, we'd get an error:

```{r eval=FALSE}
airquality %>% 
      subset(Temp > 80) %>% 
      plot %$% 
      cor(Temp, Wind)
```

The reason is that `cor` looks for the variables `Temp` and `Wind` in the plot object, and not in the data frame. The tee operator takes care of this by passing on the data from its left side.

Remember that if you have a function where data only appears within parentheses, you need to wrap the function in curly brackets:

```{r eval=FALSE}
airquality %>% 
      subset(Temp > 80) %T>%
      {cat("Number of rows in data:", nrow(.), "\n")} %$% 
      cor(Temp, Wind)
```

When using the tee operator, this is true also for call to `ggplot`, where you additionally need to wrap the plot object in a call to `print`\index{\texttt{print}}:

```{r eval=FALSE}
library(ggplot2)
airquality %>% 
      subset(Temp > 80) %T>%
      {print(ggplot(., aes(Temp, Wind)) + geom_point())} %$% 
      cor(Temp, Wind)
```

### Writing functions with pipes
If you will be reusing the same pipeline multiple times, you may want to create a function for it. Let's say that you have a data frame containing only `numeric` variables, and that you want to create a scatterplot matrix (which can be done using `plot`) and compute the correlations between all variables (using `cor`). As an example, you could do this for `airquality` as follows:

```{r eval=FALSE}
airquality %T>% plot %>% cor
```

To define a function for this combination of operators, we simply write\index{pipe!inside function}:

```{r eval=FALSE}
plot_and_cor <- . %T>% plot %>% cor
```

Note that we don't have to write `function(...)` when defining functions with pipes!

We can now use this function just like any other:

```{r eval=FALSE}
# With the airquality data:
airquality %>% plot_and_cor
plot_and_cor(airquality)

# With the bookstore data:
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
purchase <- c(20, 59, 2, 12, 22, 160, 34, 34, 29)
bookstore <- data.frame(age, purchase)
bookstore %>% plot_and_cor
```

$$\sim$$
```{exercise, label="ch6exc2b"}
Write a function that takes a data frame as input and uses pipes to print the number of `NA` values in the data, remove all rows with `NA` values and return a summary of the remaining data.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions2b)


<br>

```{exercise, label="ch6exc2c"}
Pipes are operators, that is, functions that take two variables as input and can be written without parentheses (other examples of operators are `+` and `*`). You can define your own operators just as you would any other function\index{function!operator}. For instance, we can define an operator called `quadratic` that takes two numbers `a` and `b` as input and computes the quadratic expression $(a+b)^2$:
  
```{r eval=FALSE}
`%quadratic%` <- function(a, b) { (a + b)^2 }
2 %quadratic% 3
```

Create an operator called `%against%` that takes two vectors as input and draws a scatterplot of them.

`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions2c)


## Checking conditions {#conditions}
Sometimes you'd like your code to perform different operations depending on whether or not a certain condition is fulfilled. Perhaps you want it to do something different if there is missing data, if the input is a `character` vector, or if the largest value in a `numeric` vector is greater than some number. In Section \@ref(conditionsintro) you learned how to filter data using conditions. In this section, you'll learn how to use conditional statements for a number of other tasks.

### `if` and `else`
The most important functions for checking whether a condition is fulfilled are `if` and `else`\index{\texttt{if}}\index{\texttt{else}}\index{if statements}. The basic syntax is

```{r eval=FALSE}
if(condition) { do something } else { do something else }
```

The condition should return a single `logical` value, so that it evaluates to either `TRUE` or `FALSE`. If the condition is fulfilled, i.e. if it is `TRUE`, the code inside the first pair of curly brackets will run, and if it's not (`FALSE`), the code within the second pair of curly brackets will run instead.

As a first example, assume that you want to compute the reciprocal of $x$, $1/x$, unless $x=0$, in which case you wish to print an error message:

```{r eval=FALSE}
x <- 2
if(x == 0) { cat("Error! Division by zero.") } else { 1/x }
```

Now try running the same code with `x` set to `0`:

```{r eval=FALSE}
x <- 0
if(x == 0) { cat("Error! Division by zero.") } else { 1/x }
```

Alternatively, we could check if $x\neq 0$ and then change the order of the segments within the curly brackets:

```{r eval=FALSE}
x <- 0
if(x != 0) { 1/x } else { cat("Error! Division by zero.") }
```

You don't have to write all of the code on the same line, but you must make sure that the `else` part is on the same line as the first `}`:

```{r eval=FALSE}
if(x == 0)
{ 
    cat("Error! Division by zero.")
} else
{ 
    1/x
}
```

You can also choose not to have an `else` part at all. In that case, the code inside the curly brackets will run if the condition is satisfied, and if not, nothing will happen:

```{r eval=FALSE}
x <- 0
if(x == 0) { cat("x is 0.") }

x <- 2
if(x == 0) { cat("x is 0.") }
```

Finally, if you need to check a number of conditions one after another, in order to list different possibilities, you can do so by repeated use of `if` and `else`:

```{r eval=FALSE}
if(x == 0)
{ 
    cat("Error! Division by zero.")
} else if(is.infinite((x)))
{
    cat("Error! Divison by infinity.")
} else if(is.na((x)))
{
    cat("Error! Divison by NA.")
} else
{ 
    1/x
}
```

### `&` & `&&`
Just as when we used conditions for filtering in Sections \@ref(conditionsintro) and \@ref(conditions2), it is possible to combine several conditions into one using `&` (AND) and `|` (OR). However, the `&` and `|` operators are vectorised, meaning that they will return a vector of `logical` values whenever possible. This is not desirable in conditional statements, where the condition must evaluate to a single value. Using a condition that returns a vector results in a warning message:

```{r eval=FALSE}
if(c(1, 2) == 2) { cat("The vector contains the number 2.\n") }
if(c(2, 1) == 2) { cat("The vector contains the number 2.\n") }
```

As you can see, only the first element of the `logical` vector is evaluated by `if`. Usually, if a condition evaluates to a vector, it is because you've made an error in your code. Remember, if you really need to evaluate a condition regarding the elements in a vector, you can collapse the resulting `logical` vector to a single value using `any` or `all`.

Some texts recommend using the operators `&&` and `||`\index{\texttt{\&\&}}\index{\texttt{$\mid\mid$}} instead of `&` and `|` in conditional statements. These work almost like `&` and `|`, but force the condition to evaluate to a single `logical`. I prefer to use `&` and `|`, because I want to be notified if my condition evaluates to a vector - once again, that likely means that there is an error somewhere in my code!

There is, however, one case where I much prefer `&&` and `||`. `&` and `|` always evaluate all the conditions that you're combining, while `&&` and `||` don't: `&&` stops as soon as it encounters a `FALSE` and `||` stops as soon as it encounters a `TRUE`. Consequently, you can put the conditions you wish to combine in a particular order to make sure that they can be evaluated. For instance, you may want first to check that a variable exists, and then check a property. This can be done using `exists`\index{\texttt{exists}} to check whether or not it exists - note that the variable name must be written within quotes:

```{r eval=FALSE}
# a is a variable that doesn't exist

# Using && works:
if(exists("a") && a > 0)
{
    cat("The variable exists and is positive.")
} else { cat("a doesn't exist or is negative.") }

# But using & doesn't, because it attempts to evaluate a>0
# even though a doesn't exist:
if(exists("a") & a > 0)
{
    cat("The variable exists and is positive.")
} else { cat("a doesn't exist or is negative.") }
```

### `ifelse`
It is common that you want to assign different values to a variable depending on whether or not a condition is satisfied:

```{r eval=FALSE}
x <- 2

if(x == 0)
{ 
    reciprocal <- "Error! Division by zero."
} else
{ 
    reciprocal <- 1/x
}

reciprocal
```

In fact, this situation is so common that there is a special function for it: `ifelse`\index{\texttt{ifelse}}:

```{r eval=FALSE}
reciprocal <- ifelse(x == 0, "Error! Division by zero.", 1/x)
```

`ifelse` evaluates a condition and then returns different answers depending on whether the condition is `TRUE` or `FALSE`. It can also be applied to vectors, in which case it checks the condition for each element of the vector and returns an answer for each element:

```{r eval=FALSE}
x <- c(-1, 1, 2, -2, 3)
ifelse(x > 0, "Positive", "Negative")
```

### `switch`
For the sake of readability, it is usually a good idea to try to avoid chains of the type `if() {} else if() {} else if() {} else {}`. One function that can be useful for this is `switch`\index{\texttt{switch}}, which lets you list a number of possible results, either by position (a number) or by name:

```{r eval=FALSE}
position <- 2
switch(position,
      "First position",
      "Second position",
      "Third position")

name <- "First"
switch(name,
      First = "First name",
      Second = "Second name",
      Third = "Third name")
```

You can for instance use this to decide what function should be applied to your data:

```{r eval=FALSE}
x <- 1:3
y <- c(3, 5, 4)
method <- "nonparametric2"
cor_xy <- switch(method,
      parametric = cor(x, y, method = "pearson"),
      nonparametric1 = cor(x, y, method = "spearman"),
      nonparametric2 = cor(x, y, method = "kendall"))
cor_xy
```

### Failing gracefully
Conditional statements are useful for ensuring that the input to a function you've written is of the correct type. In Section \@ref(willfunctionwork) we saw that our `average` function failed if we applied it to a `character` vector:

```{r eval=FALSE}
average <- function(x)
{
      avg <- sum(x)/length(x)
      return(avg)
}

average(c("Lady Gaga", "Tool", "Dry the River"))
```

By using a conditional statement, we can provide a more informative error message. We can check that the input is `numeric` and, if it's not, stop the function and print an error message, using `stop`:

```{r eval=FALSE}
average <- function(x)
{
      if(is.numeric(x))
      {
            avg <- sum(x)/length(x)
            return(avg)
      } else
      { 
            stop("The input must be a numeric vector.")
      }
}

average(c(1, 5, 8))
average(c("Lady Gaga", "Tool", "Dry the River"))
```

$$\sim$$

```{exercise, label="ch6exc3"}
Which of the following conditions are `TRUE`? First think about the answer, and then check it using R.

```{r eval=FALSE}
x <- 2
y <- 3
z <- -3
```

1. `x > 2`

2. `x > y | x > z`

3. `x > y & x > z`

4. `abs(x*z) >= y`

`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions3)


<br>

```{exercise, label="ch6exc4"}
Fix the errors in the following code:

```{r eval=FALSE}
x <- c(1, 2, pi, 8)

# Only compute square roots if x exists
# and contains positive values:
if(exists(x)) { if(x > 0) { sqrt(x) } }
```

`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions4)


## Iteration using loops {#loopsection}
We have already seen how you can use functions to make it easier to repeat the same task over and over. But there is still a part of the puzzle missing - what if, for example, you wish to apply a function to each column of a data frame? What if you want to apply it to data from a number of files, one at a time? The solution to these problems is to use _iteration_\index{iteration}. In this section, we'll explore how to perform iteration using _loops_\index{loop}.

### `for` loops {#forloops}
`for` loops can be used to run the same code several times, with different settings, e.g. different data, in each iteration. Their use is perhaps best explained by some examples. We create the loop using `for`\index{\texttt{for}}\index{loop!\texttt{for}}, give the name of a _control variable_ and a vector containing its values (the control variable controls how many iterations to run) and then write the code that should be repeated in each iteration of the loop. In each iteration, a new value of the control variable is used in the code, and the loop stops when all values have been used.

As a first example, let's write a `for` loop that runs a block of code five times, where the block prints the current iteration number:

```{r eval=FALSE}
for(i in 1:5)
{
    cat("Iteration", i, "\n")
}
```

This is equivalent to writing:

```{r eval=FALSE}
cat("Iteration", 1, "\n")
cat("Iteration", 2, "\n")
cat("Iteration", 3, "\n")
cat("Iteration", 4, "\n")
cat("Iteration", 5, "\n")
```

The upside is that we didn't have to copy and edit the same code multiple times - and as you can imagine, this benefit becomes even more pronounced if you have more complicated code blocks.

The values for the control variable are given in a vector, and the code block will be run once for each element in the vector - we say the we _loop over the values in the vector_. The vector doesn't have to be `numeric` - here is an example with a `character` vector:

```{r eval=FALSE}
for(word in c("one", "two", "five hundred and fifty five"))
{
    cat("Iteration", word, "\n")
}
```

Of course, loops are used for so much more than merely printing text on the screen. A common use is to perform some computation and then store the result in a vector. In this case, we must first create an empty vector to store the result in, e.g. using `vector`\index{\texttt{vector}}, which creates an empty vector of a specific type and length:

```{r eval=FALSE}
squares <- vector("numeric", 5)

for(i in 1:5)
{
    squares[i] <- i^2
}
squares
```

In this case, it would have been both simpler and computationally faster to compute the squared values by running `(1:5)^2`. This is known as a _vectorised_ solution, and is very important in R. We'll discuss vectorised solutions in detail in Section \@ref(vectorloops).

When creating the values used for the control variable, we often wish to create different sequences of numbers. Two functions that are very useful for this are `seq`\index{\texttt{seq}}, which creates sequences, and `rep`\index{\texttt{rep}}, which repeats patterns:

```{r eval=FALSE}
seq(0, 100)
seq(0, 100, by = 10)
seq(0, 100, length.out = 21)

rep(1, 4)
rep(c(1, 2), 4)
rep(c(1, 2), c(4, 2))
```

Finally, `seq_along`\index{\texttt{seq\_along}} can be used to create a sequence of indices for a vector of a data frame, which is useful if you wish to iterate some code for each element of a vector or each column of a data frame:

```{r eval=FALSE}
seq_along(airquality) # Gives the indices of all column of the data 
                      # frame
seq_along(airquality$Temp) # Gives the indices of all elements of the
                           # vector
```

Here is an example of how to use `seq_along` to compute the mean of each column of a data frame:

```{r eval=FALSE}
# Compute the mean for each column of the airquality data:
means <- vector("double", ncol(airquality))

# Loop over the variables in airquality:
for(i in seq_along(airquality))
{
      means[i] <- mean(airquality[[i]], na.rm = TRUE)
}

# Check that the results agree with those from the colMeans function:
means
colMeans(airquality, na.rm = TRUE)
```

The line inside the loop could have read `means[i] <- mean(airquality[,i], na.rm = TRUE)`, but that would have caused problems if we'd used it with a `data.table` or `tibble` object; see Section \@ref(subsetusingcn).

Finally, we can also change the values of the data in each iteration of the loop. Some machine learning methods require that the data is _standardised_\index{data!standardise}, i.e. that all columns have mean 0 and standard deviation 1. This is achieved by subtracting the mean from each variable and then dividing each variable by its standard deviation. We can write a function for this that uses a loop, changing the values of a column in each iteration:

```{r eval=FALSE}
standardise <- function(df, ...)
{
      for(i in seq_along(df))
      {
          df[[i]] <- (df[[i]] - mean(df[[i]], ...))/sd(df[[i]], ...)
      }
      return(df)
}

# Try it out:
aqs <- standardise(airquality, na.rm = TRUE)
colMeans(aqs, na.rm = TRUE) # Non-zero due to floating point
                            # arithmetics!
sd(aqs$Wind)
```

$$\sim$$

```{exercise, label="ch6exc5"}
Practice writing `for` loops by doing the following:
  
1. Compute the mean temperature for each month in the `airquality` dataset using a loop rather than an existing function.

2. Use a `for` loop to compute the maximum and minimum value of each column of the `airquality` data frame, storing the results in a data frame.

3. Make your solution to the previous task reusable by writing a function that returns the maximum and minimum value of each column of a data frame.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions5)

<br>

```{exercise, label="ch6exc5b"}
Use `rep` or `seq` to create the following vectors:
  
1. `0.25 0.5 0.75 1`

2. `1 1 1 2 2 5`

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions5b)

<br>

```{exercise, label="ch6exc6"}
As an alternative to `seq_along(airquality)` and `seq_along(airquality$Temp)`, we could create the same sequences using `1:ncol(airquality)` and `1:length(airquality$Temp)`. Use `x <- c()` to create a vector of length zero. Then create loops that use `seq_along(x)` and `1:length(x)` as values for the control variable. How many iterations are the two loops run? Which solution is preferable?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions6)


<br>

```{exercise, label="ch6exc7"}
An alternative to standardisation is _normalisation_, where all `numeric` variables are rescaled so that their smallest value is 0 and their largest value is 1. Write a function that normalises the variables in a data frame containing `numeric` columns.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions7)

<br>

```{exercise, label="ch6exc10"}
The function `list.files`\index{\texttt{list.files}} can be used to create a vector containing the names of all files in a folder. The `pattern` argument can be used to supply a regular expression describing a file name pattern. For instance, if `pattern = "\\.csv$"` is used, only `.csv` files will be listed.

Create a loop that goes through all `.csv` files in a folder and prints the names of the variables for each file.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions10)


### Loops within loops {#nestedloops}
In some situations, you'll want to put a loop inside another loop. Such loops are said to be _nested_\index{loop!nested}. An example is if we want to compute the correlation between all pairs of variables in `airquality`, and store the result in a matrix:

```{r eval=FALSE}
cor_mat <- matrix(NA, nrow = ncol(airquality),
                  ncol = ncol(airquality))
for(i in seq_along(airquality))
{
    for(j in seq_along(airquality))
    {
        cor_mat[i, j] <- cor(airquality[[i]], airquality[[j]],
                             use = "pairwise.complete")
    }
}

# Element [i, j] of the matrix now contains the correlation between
# variables i and j:
cor_mat
```

Once again, there is a vectorised solution to this problem, given by `cor(airquality, use = "pairwise.complete")`. As we will see in Section \@ref(measureperformance), vectorised solutions like this can be several times faster than solutions that use nested loops. In general, solutions involving nested loops tend to be fairly slow - but on the other hand, they are often easy and straightforward to implement.

### Keeping track of what's happening {#beepr}
Sometimes each iteration of your loop takes a long time to run, and you'll want to monitor its progress. This can be done using printed messages or a progress bar in the Console panel, or sound notifications. We'll showcase each of these using a loop containing a call to `Sys.sleep`\index{\texttt{Sys.sleep}}, which pauses the execution of R commands for a short time (determined by the user).

First, we can use `cat` to print a message describing the progress. Adding `\r` to the end of a string allows us to print all messages on the same line, with each new message replacing the old one:

```{r eval=FALSE}
# Print each message on a new same line:
for(i in 1:5)
{
    cat("Step", i, "out of 5\n")
    Sys.sleep(1) # Sleep for 1 second
}

# Replace the previous message with the new one:
for(i in 1:5)
{
    cat("Step", i, "out of 5\r")
    Sys.sleep(1) # Sleep for one second
}
```

Adding a progress bar is a little more complicated, because we must first start the bar by using `txtProgressBar`\index{\texttt{txtProgressBar}} and the update it using `setTxtProgressBar`\index{\texttt{setTxtProgressBar}}:

```{r eval=FALSE}
sequence <- 1:5
pbar <- txtProgressBar(min = 0, max = max(sequence), style = 3)
for(i in sequence)
{
    Sys.sleep(1) # Sleep for 1 second
    setTxtProgressBar(pbar, i)
}
close(pbar)
```

Finally, the `beepr`\index{\texttt{beepr}} package^[Arguably the best add-on package for R.] can be used to play sounds, with the function `beep`\index{\texttt{beep}}:

```{r eval=FALSE}
install.packages("beepr")

library(beepr)
# Play all 11 sounds available in beepr:
for(i in 1:11)
{
    beep(sound = i)
    Sys.sleep(2) # Sleep for 2 seconds
}
```

### Loops and lists {#lists}
In our previous examples of loops, it has always been clear from the start how many iterations the loop should run and what the length of the output vector (or data frame) should be. This isn't always the case. To begin with, let's consider the case where the length of the output is unknown or difficult to know in advance. Let's say that we want to go through the `airquality` data to find days that are extreme in the sense that at least one variable attains its maximum on those days. That is, we wish to find the index of the maximum of each variable, and store them in a vector. Because several days can have the same temperature or wind speed, there may be more than one such maximal index for each variable. For that reason, we don't know the length of the output vector in advance.

In such cases, it is usually a good idea to store the result from each iteration in a `list` (Section \@ref(lists2)), and then collect the elements from the list once the loop has finished. We can create an empty list with one element for each variable in `airquality` using `vector`:

```{r eval=FALSE}
# Create an empty list with one element for each variable in
# airquality:
max_list <- vector("list", ncol(airquality))

# Naming the list elements will help us see which variable the maximal
# indices belong to:
names(max_list) <- names(airquality)

# Loop over the variables to find the maxima:
for(i in seq_along(airquality))
{
      # Find indices of maximum values:
      max_index <- which(airquality[[i]] == max(airquality[[i]],
                                                na.rm = TRUE))
      
      # Add indices to list:
      max_list[[i]] <- max_index
}

# Check results:
max_list

# Collapse to a vector:
extreme_days <- unlist(max_list)
```

(In this case, only the variables `Month` and `Days` have duplicate maximum values.)


### `while` loops {#whileloop}
In some situations, we want to run a loop until a certain condition is met, meaning that we don't know in advance how many iterations we'll need. This is more common in numerical optimisation and simulation, but sometimes also occurs in data analyses.

When we don't know in advance how many iterations that are needed, we can use `while` loops\index{\texttt{while}}\index{loop!\texttt{while}}. Unlike `for` loops, that iterate a fixed number of times, `while` loops keep iterating as long as some specified condition is met. Here is an example where the loop keeps iterating until `i` squared is greater than 100:

```{r eval=FALSE}
i <- 1

while(i^2 <= 100)
{
      cat(i,"squared is", i^2, "\n")
      i <- i +1
}

i
```

The code block inside the loop keeps repeating until the condition `i^2 <= 100` no longer is satisfied. We have to be a little bit careful with this condition - if we set it in such a way that it is possible that the condition _always_ will be satisfied, the loop will just keep running and running - creating what is known as an _infinite loop_\index{infinite loop|see {infinite loop}}. If you've accidentally created an infinite loop, you can break it by pressing the Stop button at the top of the Console panel in RStudio.

In Section \@ref(rle) we saw how `rle` can be used to find and compute the lengths of runs of equal values in a vector. We can use nested `while` loops to create something similar.  `while` loops are a good choice here, because we don't know how many runs are in the vector in advance. Here is an example, which you'll study in more detail in Exercise \@ref(exr:ch6exc8):

```{r eval=FALSE}
# Create a vector of 0's and 1's:
x <- rep(c(0, 1, 0, 1, 0), c(5, 1, 4, 2, 7))

# Create empty vectors where the results will be stored:
run_values <- run_lengths <- c()

# Set the initial condition:
i <- 1

# Iterate over the entire vector:
while(i < length(x))
{
    # A new run starts:
    run_length <- 1
    cat("A run starts at i =", i, "\n")
    
    # Check how long the run continues:
    while(x[i+1] == x[i] & i < length(x))
    {
          run_length <- run_length + 1
          i <- i + 1
    }
    
    i <- i + 1
    
    # Save results:
    run_values <- c(run_values, x[i-1])
    run_lengths <- c(run_lengths, run_length)
}

# Present the results:
data.frame(run_values, run_lengths)
```

$$\sim$$

```{exercise, label="ch6exc8"}
Consider the nested `while` loops in the run length example above. Go through the code and think about what happens in each step. What happens when `i` is `1`? When it is `5`? When it is `6`? Answer the following questions:

1. What does the condition for the outer while loop check? Why is it needed?
  
2. What does the condition for the inner while loop check? Why is it needed?
  
3. What does the line `run_values <- c(run_values, x[i-1])` do?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions8)


<br>

```{exercise, label="ch6exc9"}
The _control statements_ `break`\index{\texttt{break}} and `next`\index{\texttt{next}} can be used inside both `for` and `while` loops to control their behaviour further. `break` stops a loop, and `next` skips to the next iteration of it. Use these functions to modify the following piece of code so that the loop skips to the next iteration if `x[i]` is `0`, and breaks if `x[i]` is `NA`:
  
```{r eval=FALSE}
x <- c(1, 5, 8, 0, 20, 0, 3, NA, 18, 2)

for(i in seq_along(x))
{
      cat("Step", i, "- reciprocal is", 1/x[i], "\n")
}
```

`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions9)

<br>

```{exercise, label="ch6exc9b"}
Using the `cor_mat` computation from Section \@ref(nestedloops), write a function that computes all pairwise correlations in a data frame, and uses `next` to only compute correlations for `numeric` variables. Test your function by applying it to the `msleep` data from `ggplot2`. Could you achieve the same thing without using `next`?
  
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions9b)


## Iteration using vectorisation and functionals {#vectorloops}
Many operators and functions in R take vectors as input and handle them in a highly efficient way, usually by passing the vector on to an optimised function written in the C programming language^[Unlike R, C is a low-level language that allows the user to write highly specialised (and complex) code to perform operations very quickly.]. So if we want to compute the squares of the numbers in a vector, we don't need to write a loop:

```{r eval=FALSE}
squares <- vector("numeric", 5)

for(i in 1:5)
{
    squares[i] <- i^2
}
squares
```

Instead, we can simply apply the `^` operator, which uses fast C code to compute the squares:

```{r eval=FALSE}
squares <- (1:5)^2
```

These types of functions and operators are called _vectorised_\index{function!vectorised}. They take a vector as input and apply a function to all its elements, meaning that we can avoid slower solutions utilising loops in R^[The vectorised functions often use loops, but loops written in C, which are much faster.]. Try to use vectorised solutions rather than loops whenever possible - it makes your code both easier to read and faster to run.

A related concept is _functionals_\index{functional}, which are functions that contain a `for` loop. Instead of writing a `for` loop, you can use a functional, supplying data, a function that should be applied in each iteration of the loop, and a vector to loop over. This won't necessarily make your loop run faster, but it does have other benefits:

* _Shorter code_: functionals allow you to write more concise code. Some would argue that they also allow you to write code that is easier to read, but that is obviously a matter of taste.
* _Efficient_: functionals handle memory allocation and other small tasks efficiently, meaning that you don't have to worry about creating a vector of an appropriate size to store the result.
* _No changes to your environment_: because all operations now take place in the local environment of the functional, you don't run the risk of accidentally changing variables in your global environment.
* _No left-overs_: `for` leaves the control variable (e.g. `i`) in the environment, functionals do not.
* _Easy to use with pipes_: because the loop has been wrapped in a function, it lends itself well to being used in a `%>%` pipeline.

Explicit loops are preferable when:

* You think that they are easier to read and write.
* Your functions take data frames or other non-vector objects as input.
* Each iteration of your loop depends on the results from previous iterations.

In this section, we'll see how we can apply functionals to obtain elegant alternatives to (explicit) loops.

### A first example with `apply`
The prototypical functional is `apply`\index{\texttt{apply}}, which loops over either the rows or the columns of a data frame^[Actually, over the rows or columns of a matrix - `apply` converts the data frame to a `matrix` object.]. The arguments are a dataset, the margin to loop over (`1` for rows, `2` for columns) and then the function to be applied.

In Section \@ref(forloops) we wrote a `for` loop for computing the mean value of each column in a data frame:

```{r eval=FALSE}
# Compute the mean for each column of the airquality data:
means <- vector("double", ncol(airquality))

# Loop over the variables in airquality:
for(i in seq_along(airquality))
{
      means[i] <- mean(airquality[[i]], na.rm = TRUE)
}
```

Using apply, we can reduce this to a single line. We wish to use the `airquality` data, loop over the columns (margin `2`) and apply the function `mean` to each column:

```{r eval=FALSE}
apply(airquality, 2, mean)
```

Rather elegant, don't you think?

Additional arguments can be passed to the function inside apply by adding them to the end of the function call:

```{r eval=FALSE}
apply(airquality, 2, mean, na.rm = TRUE)
```

$$\sim$$

```{exercise, label="ch6exc11"}
Use `apply` to compute the maximum and minimum value of each column of the `airquality` data frame. Can you write a function that allows you to compute both with a single call to `apply`?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions11)


### Variations on a theme
There are several variations of `apply` that are tailored to specific problems:

* `lapply`\index{\texttt{lapply}}: takes a function and vector/list as input, and returns a list.
* `sapply`\index{\texttt{sapply}}: takes a function and vector/list as input, and returns a vector or matrix.
* `vapply`\index{\texttt{vapply}}: a version of `sapply` with additional checks of the format of the output.
* `tapply`\index{\texttt{tapply}}: for looping over groups, e.g. when computing grouped summaries.
* `rapply`\index{\texttt{rapply}}: a recursive version of `tapply`.
* `mapply`\index{\texttt{mapply}}: for applying a function to multiple arguments; see Section \@ref(multiiter).
* `eapply`\index{\texttt{eapply}}: for applying a function to all objects in an environment.

We have already seen several ways to compute the mean temperature for different months in the `airquality` data (Sections \@ref(grouped) and \@ref(grouped2), and Exercise \@ref(exr:ch6exc5)). The `*apply` family offer several more:

```{r eval=FALSE}
# Create a list:
temps <- split(airquality$Temp, airquality$Month)

lapply(temps, mean)
sapply(temps, mean)
vapply(temps, mean, vector("numeric", 1))
tapply(airquality$Temp, airquality$Month, mean)
```

There is, as that delightful proverb goes, more than one way to skin a cat.

$$\sim$$

```{exercise, label="ch6exc12"}
Use an `*apply` function to simultaneously compute the monthly maximum and minimum temperature in the `airquality` data frame.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions12)


<br>

```{exercise, label="ch6exc13"}
Use an `*apply` function to simultaneously compute the monthly maximum and minimum temperature and windspeed in the `airquality` data frame.

Hint: start by writing a function that simultaneously computes the maximum and minimum temperature and windspeed for a data frame containing data from a single month.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions13)


### `purrr` {#purrr}
If you feel enthusiastic about ~~skinning cats~~ using functionals instead of loops, the tidyverse package `purrr`\index{\texttt{purrr}} is a great addition to your toolbox. It contains a number of specialised alternatives to the `*apply` functions. More importantly, it also contains certain shortcuts that come in handy when working with functionals. For instance, it is fairly common to define a short function inside your functional, which is useful for instance when you don't want the function to take up space in your environment. This can be done a little more elegantly with `purrr` functions using a shortcut denoted by `~`\index{\texttt{\textasciitilde}}. Let's say that we want to standardise all variables in `airquality`. The `map` function is the `purrr` equivalent of `lapply`. We can use it with or with the shortcut, and with or without pipes (we mention the use of pipes now because it will be important in what comes next):

```{r eval=FALSE}
# Base solution:
lapply(airquality, function(x) { (x-mean(x))/sd(x) })

# Base solution with pipe:
library(magrittr)
airquality %>% lapply(function(x) { (x-mean(x))/sd(x) })

# purrr solution:
library(purrr)
map(airquality, function(x) { (x-mean(x))/sd(x) })

# We can make the purrr solution less verbose using a shortcut:
map(airquality, ~(.-mean(.))/sd(.))

# purr solution with pipe and shortcut:
airquality %>% map(~(.-mean(.))/sd(.))
```

Where this shortcut really shines is if you need to use multiple functionals. Let's say that we want to standardise the `airquality` variables, compute a `summary` and then extract columns `2` and `5` from the summary (which contains the 1st and 3rd quartile of the data):

```{r eval=FALSE}
# Impenetrable base solution:
lapply(lapply(lapply(airquality,
                     function(x) { (x-mean(x))/sd(x) }),
              summary),
      function(x) { x[c(2, 5)] })

# Base solution with pipe:
airquality %>% 
      lapply(function(x) { (x-mean(x))/sd(x) }) %>%
      lapply(summary) %>%
      lapply(function(x) { x[c(2, 5)] })

# purrr solution:
airquality %>%
      map(~(.-mean(.))/sd(.)) %>%
      map(summary) %>%
      map(~.[c(2, 5)])
```

Once you know the meaning of `~`, the `purrr` solution is a lot cleaner than the base solutions.

### Specialised functions
So far, it may seem like `map` is just like `lapply` but with a shortcut for defining functions. Which is more or less true. But `purrr` contains a lot more functionals that you can use, each tailored to specific problems.

For instance, if you need to specify that the output should be a vector of a specific type, you can use:

* `map_dbl(data, function)`\index{\texttt{map\_dbl}} instead of `vapply(data, function, vector("numeric", length))`,
* `map_int(data, function)`\index{\texttt{map\_int}} instead of `vapply(data, function, vector("integer", length))`, 
* `map_chr(data, function)`\index{\texttt{map\_chr}} instead of `vapply(data, function, vector("character", length))`, 
* `map_lgl(data, function)`\index{\texttt{map\_lgl}} instead of `vapply(data, function, vector("logical", length))`.

If you need to specify that the output should be a data frame, you can use:

* `map_dfr(data, function)`\index{\texttt{map\_dfr}} instead of `sapply(data, function)`.

The `~` shortcut for functions is available for all these `map_*` functions. In case you need to pass additional arguments to the function inside the functional, just add them at the end of the functional call:

```{r eval=FALSE}
airquality %>% map_dbl(max)
airquality %>% map_dbl(max, na.rm = TRUE)
```

Another specialised function is the `walk` function\index{\texttt{walk}}. It works just like `map`, but doesn't return anything. This is useful if you want to apply a function with no output, such as `cat` or `read.csv`:

```{r eval=FALSE}
# Returns a list of NULL values:
airquality %>% map(~cat("Maximum:", max(.), "\n"))

# Returns nothing:
airquality %>% walk(~cat("Maximum:", max(.), "\n")) 
```

$$\sim$$

```{exercise, label="ch6exc13b"}
Use a `map_*` function to simultaneously compute the monthly maximum and minimum temperature in the `airquality` data frame, returning a vector.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions13b)


### Exploring data with functionals
Functionals are great for creating custom summaries of your data. For instance, if you want to check the data type and number of unique values of each variable in your dataset, you can do that with a functional:

```{r eval=FALSE}
library(ggplot2)
diamonds %>% map_dfr(~(data.frame(unique_values = length(unique(.)),
                                  class = class(.))))
```

You can of course combine `purrr` functionals with functions from other packages, e.g. to replace `length(unique(.))` with a function from your favourite data manipulation package:

```{r eval=FALSE}
# Using uniqueN from data.table:
library(data.table)
dia <- as.data.table(diamonds)
dia %>% map_dfr(~(data.frame(unique_values = uniqueN(.),
                             class = class(.))))

# Using n_distinct from dplyr:
library(dplyr)
diamonds %>% map_dfr(~(data.frame(unique_values = n_distinct(.),
                                  class = class(.))))
```

When creating summaries it can often be useful to be able to loop over both the elements of a vector and their indices. In `purrr`, this is done using the usual `map*` functions, but with an `i` (for index) in the beginning of their names, e.g.  `imap`\index{\texttt{imap}} and `iwalk`\index{\texttt{iwalk}}:

```{r eval=FALSE}
# Returns a list of NULL values:
imap(airquality, ~ cat(.y, ": ", median(.x), "\n", sep = ""))  

# Returns nothing:
iwalk(airquality, ~ cat(.y, ": ", median(.x), "\n", sep = "")) 
```

Note that `.x` is used to denote the variable, and that `.y` is used to denote the _name_ of the variable. If `i*` functions are used on vectors without element names, indices are used instead. The names of elements of vectors can be set using `set_names`\index{\texttt{set\_names}}:

```{r eval=FALSE}
# Without element names:
x <- 1:5
iwalk(x, ~ cat(.y, ": ", exp(.x), "\n", sep = ""))

# Set element names:
x <- set_names(x, c("exp(1)", "exp(2)", "exp(3)", "exp(4)", "exp(5)"))
iwalk(x, ~ cat(.y, ": ", exp(.x), "\n", sep = ""))
```

$$\sim$$

```{exercise, label="ch6exc13c"}
Write a function that takes a data frame as input and returns the following information about each variable in the data frame: variable name, number of unique values, data type and number of missing values. The function should, as you will have guessed, use a functional.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions13c)


<br>

```{exercise, label="ch6exc13d"}
In Exercise \@ref(exr:ch6exc10) you wrote a function that printed the names and variables for all `.csv` files in a folder given by `folder_path`. Use `purrr` functionals to do the same thing.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions13d)


### Keep calm and carry on
Another neat feature of `purrr` is the `safely`\index{\texttt{safely}} function, which can be used to wrap a function that will be used inside a functional, and makes sure that the functional returns a result even if there is an error. For instance, let's say that we want to compute the logarithm of all variables in the `msleep` data:

```{r eval=FALSE}
library(ggplot2)
msleep
```

Note that some columns are `character` vectors, which will cause `log` to throw an error:

```{r eval=FALSE}
log(msleep$name)
log(msleep)
lapply(msleep, log)
map(msleep, log)
```

Note that the error messages we get from `lapply` and `map` here don't give any information about which variable caused the error, making it more difficult to figure out what's gone wrong.

If first we wrap `log` with `safely`, we get a list containing the correct output for the numeric variables, and error messages for the non-numeric variables:

```{r eval=FALSE}
safe_log <- safely(log)
lapply(msleep, safe_log)
map(msleep, safe_log)
```

Not only does this tell us where the errors occur, but it also returns the logarithms for all variables that `log` actually could be applied to.

If you'd like your functional to return some default value, e.g. `NA`, instead of an error message, you can use `possibly`\index{\texttt{possibly}} instead of `safely`:

```{r eval=FALSE}
pos_log <- possibly(log, otherwise = NA)
map(msleep, pos_log)
```


### Iterating over multiple variables {#multiiter}
A final important case is when you want to iterate over more than one variable. This is often the case when fitting statistical models that should be used for prediction, as you'll see in Section \@ref(multipredict). Another example is when you wish to create plots for several subsets in your data. For instance, we could create a plot of `carat` versus `price` for each combination of `color` and `cut` in the `diamonds` data. To do this for a single combination, we'd use something like this:

```{r eval=FALSE}
library(ggplot2)
library(dplyr)

diamonds %>% filter(cut == "Fair",
                    color == "D") %>% 
             ggplot(aes(carat, price)) +
                 geom_point() +
                 ggtitle("Fair, D")
```

To create such a plot for all combinations of `color` and `cut`, we must first create a data frame containing all unique combinations, which can be done using the `distinct`\index{\texttt{distinct}} function from `dplyr`:

```{r eval=FALSE}
combos <- diamonds %>% distinct(cut, color)
cuts <- combos$cut
colours <- combos$color
```

`map2`\index{\texttt{map2}} and `walk2`\index{\texttt{walk2}} from `purrr` loop over the elements of two vectors, `x` and `y`, say. They combine the first element of `x` with the first element of `y`, the second element of `x` with the second element of `y`, and so on - meaning that they won't automatically loop over all combinations of elements. That is the reason why we use `distinct` above to create two vectors where each pair (`x[i]`, `y[i]`) correspond to a combination. Apart from the fact that we add a second vector to the call, `map2` and `walk2` work just like `map` and `walk`:

```{r eval=FALSE}
# Print all pairs:
walk2(cuts, colours, ~cat(.x, .y, "\n"))

# Create a plot for each pair:
combos_plots <- map2(cuts, colours, ~{
                     diamonds %>% filter(cut == .x,
                                         color == .y) %>% 
                      ggplot(aes(carat, price)) +
                             geom_point() +
                             ggtitle(paste(.x, .y, sep =", "))})

# View some plots:
combos_plots[[1]]
combos_plots[[30]]

# Save all plots in a pdf file, with one plot per page:
pdf("all_combos_plots.pdf", width = 8, height = 8)
combos_plots
dev.off()
```

The base function `mapply` could also have been used here. If you need to iterate over more than two vectors, you can use `pmap`\index{\texttt{pmap}} or `pwalk`\index{\texttt{pwalk}}, which work analogously to `map2` and `walk2`.

$$\sim$$

```{exercise, label="ch6exc13e"}
Using the `gapminder` data from the `gapminder` package, create scatterplots of `pop` and `lifeExp` for each combination of `continent` and `year`. Save each plot as a separate `.png` file.

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions13e)


## Measuring code performance {#measureperformance}
There are probably as many ideas about what good code is as there are programmers. Some prefer readable code; others prefer concise code. Some prefer to work with separate functions for each task, while others would rather continue to combine a few basic functions in new ways. Regardless of what you consider to be _good code_, there are a few objective measures that can be used to assess the quality of your code. In addition to writing code that works and is bug-free, you'd like your code to be:

* _Fast_: meaning that it runs quickly. Some tasks can take anything from second to weeks, depending on what code you write for them. Speed is particularly important if you're going to run your code many times. 
* _Memory efficient_: meaning that it uses as little of your computer's memory as possible. Software running on your computer uses its memory - its RAM - to store data. If you're not careful with RAM, you may end up with a full memory and a sluggish or frozen computer. Memory efficiency is critical if you're working with big datasets, that take up a lot of RAM to begin with.

In this section we'll have a look at how you can measure the speed and memory efficiency of R functions. A caveat is that while speed and memory efficiency are important, the most important thing is to come up with a solution that works in the first place. You should almost always start by solving a problem, and then worry about speed and memory efficiency, not the other way around. The reasons for this is that efficient code often is more difficult to write, read, and debug, which can slow down the process of writing it considerably.

Note also, that speed and memory usage is system dependent - the clock frequency and architecture of your processor and speed and size of your RAM will affect how your code performs, as will what operating system you use and what other programs you are running at the same time. That means that if you wish to compare how two functions perform, you need to compare them on the same system under the same conditions.

As a side note, a great way to speed up functions that use either loops or functionals is parallelisation. We cover that topic in Section \@ref(parallel).

### Timing functions
To measure how long a piece of code takes to run, we can use `system.time`\index{\texttt{system.time}} as follows:

```{r eval=FALSE}
rtime <- system.time({
      x <- rnorm(1e6)
      mean(x)
      sd(x)
})

# elapsed is the total time it took to execute the code:
rtime
```

This isn't the best way of measuring computational time though, and doesn't allow us to compare different functions easily. Instead, we'll use the `bench`\index{\texttt{bench}} package, which contains a function called `mark` that is very useful for measuring the execution time of functions and blocks of code. Let's start by installing it:

```{r eval=FALSE}
install.packages("bench")
```

In Section \@ref(createfunctions) we wrote a function for computing the mean of a vector:

```{r eval=FALSE}
average <- function(x)
{
      return(sum(x)/length(x))
}
```

Is this faster or slower than `mean`? We can use `mark`\index{\texttt{mark}}\index{function!benchmark} to apply both functions to a vector multiple times, and measure how long each execution takes:

```{r eval=FALSE}
library(bench)
x <- 1:100
bm <- mark(mean(x), average(x))
bm # Or use View(bm) if you don't want to print the results in the
   # Console panel.
```

`mark` has executed both function `n_itr` times each, and measured how long each execution took to perform. The execution time varies - in the output you can see the shortest (`min`) and median (`median`) execution times, as well as the number of iterations per second (`itr/sec`). Be a little wary of the units for the execution times so that you don't get them confused - a millisecond (ms, $10^{-3}$ seconds) is 1,000 microseconds (µs, 1 µs is $10^{-6}$ seconds), and 1 microsecond is 1,000 nanoseconds (ns, 1 ns is $10^{-9}$ seconds).

The result here may surprise you - it appears that `average` is faster than `mean`! The reason is that `mean` does a lot of things that `average` does not: it checks the data type and gives error messages if the data is of the wrong type (e.g. `character`), and then traverses the vector twice to lower the risk of errors due to floating point arithmetics. All of this takes time, and makes the function slower (but safer to use).

We can plot the results using the `ggbeeswarm`\index{\texttt{ggbeeswarm}} package:

```{r eval=FALSE}
install.packages("ggbeeswarm")

plot(bm)
```

It is also possible to place blocks of code inside curly brackets, `{ }`, in `mark`. Here is an example comparing a vectorised solution for computing the squares of a vector with a solution using a loop:

```{r eval=FALSE}
x <- 1:100
bm <- mark(x^2,
    {
        y <- x
        for(i in seq_along(x))
        {
            y[i] <- x[i]*x[i]
        }
        y
    })
bm
plot(bm)
```

Although the above code works, it isn't the prettiest, and the `bm` table looks a bit confusing because of the long expression for the code block. I prefer to put the code block inside a function instead:

```{r eval=FALSE}
squares <- function(x)
{
        y <- x
        for(i in seq_along(x))
        {
            y[i] <- x[i]*x[i]
        }
        return(y)
}

x <- 1:100
bm <- mark(x^2, squares(x))
bm
plot(bm)
```

Note that `squares(x)` is faster than the original code block:

```{r eval=FALSE}
bm <- mark(squares(x),
    {
        y <- x
        for(i in seq_along(x))
        {
            y[i] <- x[i]*x[i]
        }
        y
    })
bm
```

Functions in R are _compiled_ the first time they are run, which often makes them run faster than the same code would have outside of the function. We'll discuss this further next.

### Measuring memory usage - and a note on compilation
`mark` also shows us how much memory is allocated when running different code blocks, in the `mem_alloc` column of the output^[But only if your version of R has been _compiled with memory profiling_. If you are using a standard build of R, i.e. have downloaded the base R binary from R-project.org, you should be good to go. You can check that memory profiling is enabled by checking that `capabilities("profmem")` returns `TRUE`. If not, you may need to reinstall R if you wish to enable memory profiling.].

Unfortunately, measuring memory usage is a little tricky. To see why, restart R (yes, really - this is important!), and then run the following code to benchmark `x^2` versus `squares(x)`:

```{r eval=FALSE}
library(bench)

squares <- function(x)
{
        y <- x
        for(i in seq_along(x))
        {
            y[i] <- x[i]*x[i]
        }
        return(y)
}

x <- 1:100
bm <- mark(x^2, squares(x))
bm
```

Judging from the `mem_alloc` column, it appears that the `squares(x)` loop not only is slower, but also uses more memory. But wait! Let's run the code again, just to be sure of the result:

```{r eval=FALSE}
bm <- mark(x^2, squares(x))
bm
```

This time out, both functions use less memory, and `squares` now uses _less_ memory than `x^2`. What's going on?

Computers can't read code written in R or most other programming languages directly. Instead, the code must be translated to _machine code_ that the computer's processor uses, in a process known as _compilation_. R uses _just-in-time compilation_ of functions and loops^[Since R 3.4.], meaning that it translates the R code for new functions and loops to machine code _during execution_. Other languages, such as C, use ahead-of-time compilation, translating the code _prior to execution_. The latter can make the execution much faster, but some flexibility is lost, and the code needs to be run through a compiler ahead of execution, which also takes time. When doing the just-in-time compilation, R needs to use some of the computer's memory, which causes the memory usage to be greater the first time the function is run. However, if an R function is run again, it has already been compiled, meaning R doesn't have to allocate memory for compilation.

In conclusion, if you want to benchmark the memory usage of functions, make sure to run them once before benchmarking. Alternatively, if your function takes a long time to run, you can compile it without running it using the `cmpfun`\index{\texttt{compfun}}\index{function!compile} function from the `compiler` package\index{\texttt{compiler}}:

```{r eval=FALSE}
library(compiler)
squares <- cmpfun(squares)
squares(1:10)
```

$$\sim$$

```{exercise, label="ch6exc14"}
Write a function for computing the mean of a vector using a `for` loop. How much slower than `mean` is it? Which function uses more memory?

```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions14)


<br>

```{exercise, label="ch6exc15"}
We have seen three different ways of filtering a data frame to only keep rows that fulfil a condition: using base R, `data.table` and `dplyr`. Suppose that we want to extract all flights from 1 January from the `flights` data in the `nycflights13` package:

```{r eval=FALSE}
library(data.table)
library(dplyr)
library(nycflights13)
# Read about the data:
?flights

# Make a data.table copy of the data:
flights.dt <- as.data.table(flights)

# Filtering using base R:
flights0101 <- flights[flights$month == 1 & flights$day == 1,]
# Filtering using data.table:
flights0101 <- flights.dt[month == 1 & day == 1,]
# Filtering using dplyr:
flights0101 <- flights %>% filter(month ==1, day == 1)
```

Compare the speed and memory usage of these three approaches. Which has the best performance?

`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions15)


\vfill