This repository has been archived by the owner on Aug 27, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 15
/
06-programming.Rmd
1633 lines (1169 loc) · 71.2 KB
/
06-programming.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# R programming {#progchapter}
The tools in Chapters \@ref(thebasics)-\@ref(messychapter) will allow you to manipulate, summarise and visualise your data in all sorts of ways. But what if you need to compute some statistic that there isn't a function for? What if you need automatic checks of your data and results? What if you need to repeat the same analysis for a large number of files? This is where the programming tools you'll learn about in this chapter, like loops and conditional statements, come in handy. And this is where you take the step from being able to use R for routine analyses to being able to use R for _any_ analysis.
After working with the material in this chapter, you will be able to use R to:
* Write your own R functions,
* Use several new pipe operators,
* Use conditional statements to perform different operations depending on whether or not a condition is satisfied,
* Iterate code operations multiple times using loops,
* Iterate code operations multiple times using functionals,
* Measure the performance of your R code.
## Functions
Suppose that we wish to compute the mean of a vector `x`. One way to do this would be to use `sum` and `length`:
```{r eval=FALSE}
x <- 1:100
# Compute mean:
sum(x)/length(x)
```
Now suppose that we wish to compute the mean of several vectors. We could do this by repeated use of `sum` and `length`:
```{r eval=FALSE}
x <- 1:100
y <- 1:200
z <- 1:300
# Compute means:
sum(x)/length(x)
sum(y)/length(y)
sum(z)/length(x)
```
But wait! I made a mistake when I copied the code to compute the mean of `z` - I forgot to change `length(x)` to `length(z)`! This is an easy mistake to make when you repeatedly copy and paste code. In addition, repeating the same code multiple times just doesn't look good. It would be much more convenient to have a single function for computing the means. Fortunately, such a function exists - `mean`:
```{r eval=FALSE}
# Compute means
mean(x)
mean(y)
mean(z)
```
As you can see, using `mean` makes the code shorter and easier to read and reduces the risk of errors induced by copying and pasting code (we only have to change the argument of one function instead of two).
You've already used a ton of different functions in R: functions for computing means, manipulating data, plotting graphics, and more. All these functions have been written by somebody who thought that they needed to repeat a task (e.g. computing a mean or plotting a bar chart) over and over again. And in such cases, it is much more convenient to have a function that does that task than to have to write or copy code every time you want to do it. This is true also for your own work - whenever you need to repeat the same task several times, it is probably a good idea to write a function for it. It will reduce the amount of code you have to write and lessen the risk of errors caused by copying and pasting old code. In this section, you will learn how to write your own functions.
### Creating functions {#createfunctions}
For the sake of the example, let's say that we wish to compute the mean of several vectors but that the function `mean` doesn't exist. We would therefore like to write our own function for computing the mean of a vector. An R function takes some variables as input (arguments or parameters) and returns an object. Functions are defined using `function`\index{function!define}\index{\texttt{function}}\index{\texttt{return}}. The definition follows a particular format:
```{r eval=FALSE}
function_name <- function(argument1, argument2, ...)
{
# ...
# Some rows with code that creates some_object
# ...
return(some_object)
}
```
In the case of our function for computing a mean, this could look like:
```{r eval=FALSE}
average <- function(x)
{
avg <- sum(x)/length(x)
return(avg)
}
```
This defines a function called `average`, that takes an object called `x` as input. It computes the sum of the elements of `x`, divides that by the number of elements in `x`, and returns the resulting mean.
If we now make a call to `average(x)`, our function will compute the mean value of the vector `x`. Let's try it out, to see that it works:
```{r eval=FALSE}
x <- 1:100
y <- 1:200
average(x)
average(y)
```
### Local and global variables
Note that despite the fact that the vector was called `x` in the code we used to define the function, `average` works regardless of whether the input is called `x` or `y`. This is because R distinguishes between _global variables_ and _local variables_\index{variable!global}\index{variable!local}. A global variable is created in the _global environment_ outside a function, and is available to all functions (these are the variables that you can see in the Environment panel in RStudio). A local variable is created in the _local environment_ inside a function, and is only available to that particular function. For instance, our `average` function creates a variable called `avg`, yet when we attempt to access `avg` after running `average` this variable doesn't seem to exist:
```{r eval=FALSE}
average(x)
avg
```
Because `avg` is a local variable, it is only available inside of the `average` function. Local variables take precedence over global variables inside the functions to which they belong. Because we named the argument used in the function `x`, `x` becomes the name of a local variable in `average`. As far as `average` is concerned, there is only one variable named `x`, and that is whatever object that was given as input to the function, regardless of what its original name was. Any operations performed on the local variable `x` won't affect the global variable `x` at all.
Functions can access global variables:
```{r eval=FALSE}
y_squared <- function()
{
return(y^2)
}
y <- 2
y_squared()
```
But operations performed on global variables inside functions won't affect the global variable:
```{r eval=FALSE}
add_to_y <- function(n)
{
y <- y + n
}
y <- 1
add_to_y(1)
y
```
Suppose you really need to change a global variable inside a function^[Do you _really_?]. In that case, you can use an alternative assignment operator, `<<-`\index{\texttt{<<-}}, which assigns a value to the variable in the _parent environment_ to the current environment. If you use `<<-` for assignment inside a function that is called from the global environment, this means that the assignment takes place in the global environment. But if you use `<<-` in a function (function 1) that is called by another function (function 2), the assignment will take place in the environment for function 2, thus affecting a local variable in function 2. Here is an example of a global assignment using `<<-`:
```{r eval=FALSE}
add_to_y_global <- function(n)
{
y <<- y + n
}
y <- 1
add_to_y_global(1)
y
```
### Will your function work? {#willfunctionwork}
It is always a good idea to test if your function works as intended, and to try to figure out what can cause it to break. Let's return to our `average` function:
```{r eval=FALSE}
average <- function(x)
{
avg <- sum(x)/length(x)
return(avg)
}
```
We've already seen that it seems to work when the input `x` is a numeric vector. But what happens if we input something else instead?
```{r eval=FALSE}
average(c(1, 5, 8)) # Numeric input
average(c(TRUE, TRUE, FALSE)) # Logical input
average(c("Lady Gaga", "Tool", "Dry the River")) # Character input
average(data.frame(x = c(1, 1, 1), y = c(2, 2, 1))) # Numeric df
average(data.frame(x = c(1, 5, 8), y = c("A", "B", "C"))) # Mixed type
```
The first two of these render the desired output (the `logical` values being represented by 0's and 1's), but the rest don't. Many R functions include checks that the input is of the correct type, or checks to see which method should be applied depending on what data type the input is. We'll learn how to perform such checks in Section \@ref(conditions).
As a side note, it is possible to write functions that don't end with `return`. In that case, the output (i.e. what would be written in the Console if you'd run the code there) from the last line of the function will automatically be returned. I prefer to (almost) always use `return` though, as it is easy to accidentally make the function return nothing by finishing it with a line that yields no output. Below are two examples of how we could have written `average` without a call to `return`. The first doesn't work as intended, because the function's final (and only) line doesn't give any output.
```{r eval=FALSE}
average_bad <- function(x)
{
avg <- sum(x)/length(x)
}
average_ok <- function(x)
{
sum(x)/length(x)
}
average_bad(c(1, 5, 8))
average_ok(c(1, 5, 8))
```
### More on arguments
It is possible to create functions with as many arguments as you like, but it will become quite unwieldy if the user has to supply too many arguments to your function. It is therefore common to provide default values to arguments, which is done by setting a value in the `function` call\index{function!default value of argument}. Here is an example of a function that computes $x^n$, using $n=2$ as the default:
```{r eval=FALSE}
power_n <- function(x, n = 2)
{
return(x^n)
}
```
If we don't supply `n`, `power_n` uses the default `n = 2`:
```{r eval=FALSE}
power_n(3)
```
But if we supply an `n`, `power_n` will use that instead:
```{r eval=FALSE}
power_n(3, 1)
power_n(3, 3)
```
For clarity, you can specify which value corresponds to which argument:
```{r eval=FALSE}
power_n(x = 2, n = 5)
```
...and can then even put the arguments in the wrong order:
```{r eval=FALSE}
power_n(n = 5, x = 2)
```
However, if we only supply `n` we get an error, because there is no default value for `x`:
```{r eval=FALSE}
power_n(n = 5)
```
It is possible to pass a function as an argument\index{function!function as argument}. Here is a function that takes a vector and a function as input, and applies the function to the first two elements of the vector:
```{r eval=FALSE}
apply_to_first2 <- function(x, func)
{
result <- func(x[1:2])
return(result)
}
```
By supplying different functions to `apply_to_first2`, we can make it perform different tasks:
```{r eval=FALSE}
x <- c(4, 5, 6)
apply_to_first2(x, sqrt)
apply_to_first2(x, is.character)
apply_to_first2(x, power_n)
```
But what if the function that we supply requires additional arguments? Using `apply_to_first2` with `sum` and the vector `c(4, 5, 6)` works fine:
```{r eval=FALSE}
apply_to_first2(x, sum)
```
But if we instead use the vector `c(4, NA, 6)` the function returns `NA` :
```{r eval=FALSE}
x <- c(4, NA, 6)
apply_to_first2(x, sum)
```
Perhaps we'd like to pass `na.rm = TRUE` to `sum` to ensure that we get a `numeric` result, if at all possible. This can be done by adding `...` to the list of arguments for both functions\index{function!\texttt{...} argument}\index{\texttt{...}}, which indicates additional parameters (to be supplied by the user) that will be passed to ` func`:
```{r eval=FALSE}
apply_to_first2 <- function(x, func, ...)
{
result <- func(x[1:2], ...)
return(result)
}
x <- c(4, NA, 6)
apply_to_first2(x, sum)
apply_to_first2(x, sum, na.rm = TRUE)
```
$$\sim$$
```{exercise, label="ch6exc1"}
Write a function that converts temperature measurements in degrees Fahrenheit to degrees Celsius, and apply it to the `Temp` column of the `airquality` data.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions1)
<br>
```{exercise, label="ch6exc2"}
Practice writing functions by doing the following:
1. Write a function that takes a vector as input and returns a vector containing its minimum and maximum, without using `min` and `max`.
2. Write a function that computes the mean of the squared values of a vector using `mean`, and that takes additional arguments that it passes on to `mean` (e.g. `na.rm`).
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions2)
### Namespaces
It is possible, and even likely, that you will encounter functions in packages with the same name as functions in other packages. Or, similarly, that there are functions in packages with the same names as those you have written yourself. This is of course a bit of a headache, but it's actually something that can be overcome without changing the names of the functions. Just like variables can live in different environments, R functions live in _namespaces_\index{namespace}\index{function!namespace}, usually corresponding to either the global environment or the package they belong to. By specifying which namespace to look for the function in, you can use multiple functions that all have the same name.
For example, let's create a function called `sqrt`. There is already such a function in the `base`\index{\texttt{base}} package^[`base` is automatically loaded when you start R, and contains core functions such as `sqrt`.] (see `?sqrt`), but let's do it anyway:
```{r eval=FALSE}
sqrt <- function(x)
{
return(x^10)
}
```
If we now apply `sqrt` to an object, the function that we just defined will be used:
```{r eval=FALSE}
sqrt(4)
```
But if we want to use the `sqrt` from `base`, we can specify that by writing the namespace (which almost always is the package name) followed by `::`\index{\texttt{::}} and the function name:
```{r eval=FALSE}
base::sqrt(4)
```
The `::` notation can also be used to call a function\index{package!use function from without loading} or object from a package without loading the package's namespace:
```{r eval=FALSE}
msleep # Doesn't work if ggplot2 isn't loaded
ggplot2::msleep # Works, without loading the ggplot2 namespace!
```
When you call a function, R will look for it in all active namespaces, following a particular order. To see the order of the namespaces, you can use `search`\index{\texttt{search}}:
```{r eval=FALSE}
search()
```
Note that the global environment is first in this list - meaning that the functions that you define always will be preferred to functions in packages.
All this being said, note that it is bad practice to give your functions and variables the same names as common functions. Don't name them `mean`, `c` or `sqrt`. Nothing good can ever come from that sort of behaviour.
Nothing.
### Sourcing other scripts
If you want to reuse a function that you have written in a new script, you can of course copy it into that script. But if you then make changes to your function, you will quickly end up with several different versions of it. A better idea can therefore be to put the function in a separate script, which you then can call in each script where you need the function. This is done using `source`\index{\texttt{source}}. If, for instance, you have code that defines some functions in a file called `helper-functions.R` in your working directory, you can run it (thus defining the functions) when the rest of your code is run by adding `source("helper-functions.R")` to your code.
Another option is to create an R package containing the function, but that is beyond the scope of this book. Should you choose to go down that route, I highly recommend reading [_R Packages_](https://r-pkgs.org/index.html) by Wickham and Bryan.
## More on pipes {#morepipes}
We have seen how the `magrittr` pipe `%>%` can be used to chain functions together. But there are also other pipe operators that are useful. In this section we'll look at some of them, and see how you can create functions using pipes.
### _Ce ne sont pas non plus des pipes_
Although `%>%` is the most used pipe operator\index{\texttt{pipe}}, the `magrittr` package provides a number of other pipes that are useful in certain situations.
One example is when you want to pass variables rather than an entire dataset to the next function. This is needed for instance if you want to use `cor` to compute the correlation between two variables, because `cor` takes two vectors as input instead of a data frame. You can do it using ordinary `%>%` pipes:
```{r eval=FALSE}
library(magrittr)
airquality %>%
subset(Temp > 80) %>%
{cor(.$Temp, .$Wind)}
```
However, the curly brackets `{}` and the dots `.` makes this a little awkward and difficult to read. A better option is to use the `%$%`\index{\texttt{\%\$\%}} pipe, which passes on the names of all variables in your data frame instead:
```{r eval=FALSE}
airquality %>%
subset(Temp > 80) %$%
cor(Temp, Wind)
```
If you want to modify a variable using a pipe, you can use the _compound assignment_ pipe `%<>%`\index{\texttt{\%<>\%}}. The following three lines all yield exactly the same result:
```{r eval=FALSE}
x <- 1:8; x <- sqrt(x); x
x <- 1:8; x %>% sqrt -> x; x
x <- 1:8; x %<>% sqrt; x
```
As long as the first pipe in the pipeline is the compound assignment operator `%<>%`, you can combine it with other pipes:
```{r eval=FALSE}
x <- 1:8
x %<>% subset(x > 5) %>% sqrt
x
```
Sometimes you want to do something in the middle of a pipeline, like creating a plot, before continuing to the next step in the chain. The _tee_ operator `%T>%`\index{\texttt{\%T>\%}}\index{pipe!plot in chain} can be used to execute a function without passing on its output (if any). Instead, it passes on the output to its left. Here is an example:
```{r eval=FALSE}
airquality %>%
subset(Temp > 80) %T>%
plot %$%
cor(Temp, Wind)
```
Note that if we'd used an ordinary pipe `%>%` instead, we'd get an error:
```{r eval=FALSE}
airquality %>%
subset(Temp > 80) %>%
plot %$%
cor(Temp, Wind)
```
The reason is that `cor` looks for the variables `Temp` and `Wind` in the plot object, and not in the data frame. The tee operator takes care of this by passing on the data from its left side.
Remember that if you have a function where data only appears within parentheses, you need to wrap the function in curly brackets:
```{r eval=FALSE}
airquality %>%
subset(Temp > 80) %T>%
{cat("Number of rows in data:", nrow(.), "\n")} %$%
cor(Temp, Wind)
```
When using the tee operator, this is true also for call to `ggplot`, where you additionally need to wrap the plot object in a call to `print`\index{\texttt{print}}:
```{r eval=FALSE}
library(ggplot2)
airquality %>%
subset(Temp > 80) %T>%
{print(ggplot(., aes(Temp, Wind)) + geom_point())} %$%
cor(Temp, Wind)
```
### Writing functions with pipes
If you will be reusing the same pipeline multiple times, you may want to create a function for it. Let's say that you have a data frame containing only `numeric` variables, and that you want to create a scatterplot matrix (which can be done using `plot`) and compute the correlations between all variables (using `cor`). As an example, you could do this for `airquality` as follows:
```{r eval=FALSE}
airquality %T>% plot %>% cor
```
To define a function for this combination of operators, we simply write\index{pipe!inside function}:
```{r eval=FALSE}
plot_and_cor <- . %T>% plot %>% cor
```
Note that we don't have to write `function(...)` when defining functions with pipes!
We can now use this function just like any other:
```{r eval=FALSE}
# With the airquality data:
airquality %>% plot_and_cor
plot_and_cor(airquality)
# With the bookstore data:
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
purchase <- c(20, 59, 2, 12, 22, 160, 34, 34, 29)
bookstore <- data.frame(age, purchase)
bookstore %>% plot_and_cor
```
$$\sim$$
```{exercise, label="ch6exc2b"}
Write a function that takes a data frame as input and uses pipes to print the number of `NA` values in the data, remove all rows with `NA` values and return a summary of the remaining data.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions2b)
<br>
```{exercise, label="ch6exc2c"}
Pipes are operators, that is, functions that take two variables as input and can be written without parentheses (other examples of operators are `+` and `*`). You can define your own operators just as you would any other function\index{function!operator}. For instance, we can define an operator called `quadratic` that takes two numbers `a` and `b` as input and computes the quadratic expression $(a+b)^2$:
```{r eval=FALSE}
`%quadratic%` <- function(a, b) { (a + b)^2 }
2 %quadratic% 3
```
Create an operator called `%against%` that takes two vectors as input and draws a scatterplot of them.
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions2c)
## Checking conditions {#conditions}
Sometimes you'd like your code to perform different operations depending on whether or not a certain condition is fulfilled. Perhaps you want it to do something different if there is missing data, if the input is a `character` vector, or if the largest value in a `numeric` vector is greater than some number. In Section \@ref(conditionsintro) you learned how to filter data using conditions. In this section, you'll learn how to use conditional statements for a number of other tasks.
### `if` and `else`
The most important functions for checking whether a condition is fulfilled are `if` and `else`\index{\texttt{if}}\index{\texttt{else}}\index{if statements}. The basic syntax is
```{r eval=FALSE}
if(condition) { do something } else { do something else }
```
The condition should return a single `logical` value, so that it evaluates to either `TRUE` or `FALSE`. If the condition is fulfilled, i.e. if it is `TRUE`, the code inside the first pair of curly brackets will run, and if it's not (`FALSE`), the code within the second pair of curly brackets will run instead.
As a first example, assume that you want to compute the reciprocal of $x$, $1/x$, unless $x=0$, in which case you wish to print an error message:
```{r eval=FALSE}
x <- 2
if(x == 0) { cat("Error! Division by zero.") } else { 1/x }
```
Now try running the same code with `x` set to `0`:
```{r eval=FALSE}
x <- 0
if(x == 0) { cat("Error! Division by zero.") } else { 1/x }
```
Alternatively, we could check if $x\neq 0$ and then change the order of the segments within the curly brackets:
```{r eval=FALSE}
x <- 0
if(x != 0) { 1/x } else { cat("Error! Division by zero.") }
```
You don't have to write all of the code on the same line, but you must make sure that the `else` part is on the same line as the first `}`:
```{r eval=FALSE}
if(x == 0)
{
cat("Error! Division by zero.")
} else
{
1/x
}
```
You can also choose not to have an `else` part at all. In that case, the code inside the curly brackets will run if the condition is satisfied, and if not, nothing will happen:
```{r eval=FALSE}
x <- 0
if(x == 0) { cat("x is 0.") }
x <- 2
if(x == 0) { cat("x is 0.") }
```
Finally, if you need to check a number of conditions one after another, in order to list different possibilities, you can do so by repeated use of `if` and `else`:
```{r eval=FALSE}
if(x == 0)
{
cat("Error! Division by zero.")
} else if(is.infinite((x)))
{
cat("Error! Divison by infinity.")
} else if(is.na((x)))
{
cat("Error! Divison by NA.")
} else
{
1/x
}
```
### `&` & `&&`
Just as when we used conditions for filtering in Sections \@ref(conditionsintro) and \@ref(conditions2), it is possible to combine several conditions into one using `&` (AND) and `|` (OR). However, the `&` and `|` operators are vectorised, meaning that they will return a vector of `logical` values whenever possible. This is not desirable in conditional statements, where the condition must evaluate to a single value. Using a condition that returns a vector results in a warning message:
```{r eval=FALSE}
if(c(1, 2) == 2) { cat("The vector contains the number 2.\n") }
if(c(2, 1) == 2) { cat("The vector contains the number 2.\n") }
```
As you can see, only the first element of the `logical` vector is evaluated by `if`. Usually, if a condition evaluates to a vector, it is because you've made an error in your code. Remember, if you really need to evaluate a condition regarding the elements in a vector, you can collapse the resulting `logical` vector to a single value using `any` or `all`.
Some texts recommend using the operators `&&` and `||`\index{\texttt{\&\&}}\index{\texttt{$\mid\mid$}} instead of `&` and `|` in conditional statements. These work almost like `&` and `|`, but force the condition to evaluate to a single `logical`. I prefer to use `&` and `|`, because I want to be notified if my condition evaluates to a vector - once again, that likely means that there is an error somewhere in my code!
There is, however, one case where I much prefer `&&` and `||`. `&` and `|` always evaluate all the conditions that you're combining, while `&&` and `||` don't: `&&` stops as soon as it encounters a `FALSE` and `||` stops as soon as it encounters a `TRUE`. Consequently, you can put the conditions you wish to combine in a particular order to make sure that they can be evaluated. For instance, you may want first to check that a variable exists, and then check a property. This can be done using `exists`\index{\texttt{exists}} to check whether or not it exists - note that the variable name must be written within quotes:
```{r eval=FALSE}
# a is a variable that doesn't exist
# Using && works:
if(exists("a") && a > 0)
{
cat("The variable exists and is positive.")
} else { cat("a doesn't exist or is negative.") }
# But using & doesn't, because it attempts to evaluate a>0
# even though a doesn't exist:
if(exists("a") & a > 0)
{
cat("The variable exists and is positive.")
} else { cat("a doesn't exist or is negative.") }
```
### `ifelse`
It is common that you want to assign different values to a variable depending on whether or not a condition is satisfied:
```{r eval=FALSE}
x <- 2
if(x == 0)
{
reciprocal <- "Error! Division by zero."
} else
{
reciprocal <- 1/x
}
reciprocal
```
In fact, this situation is so common that there is a special function for it: `ifelse`\index{\texttt{ifelse}}:
```{r eval=FALSE}
reciprocal <- ifelse(x == 0, "Error! Division by zero.", 1/x)
```
`ifelse` evaluates a condition and then returns different answers depending on whether the condition is `TRUE` or `FALSE`. It can also be applied to vectors, in which case it checks the condition for each element of the vector and returns an answer for each element:
```{r eval=FALSE}
x <- c(-1, 1, 2, -2, 3)
ifelse(x > 0, "Positive", "Negative")
```
### `switch`
For the sake of readability, it is usually a good idea to try to avoid chains of the type `if() {} else if() {} else if() {} else {}`. One function that can be useful for this is `switch`\index{\texttt{switch}}, which lets you list a number of possible results, either by position (a number) or by name:
```{r eval=FALSE}
position <- 2
switch(position,
"First position",
"Second position",
"Third position")
name <- "First"
switch(name,
First = "First name",
Second = "Second name",
Third = "Third name")
```
You can for instance use this to decide what function should be applied to your data:
```{r eval=FALSE}
x <- 1:3
y <- c(3, 5, 4)
method <- "nonparametric2"
cor_xy <- switch(method,
parametric = cor(x, y, method = "pearson"),
nonparametric1 = cor(x, y, method = "spearman"),
nonparametric2 = cor(x, y, method = "kendall"))
cor_xy
```
### Failing gracefully
Conditional statements are useful for ensuring that the input to a function you've written is of the correct type. In Section \@ref(willfunctionwork) we saw that our `average` function failed if we applied it to a `character` vector:
```{r eval=FALSE}
average <- function(x)
{
avg <- sum(x)/length(x)
return(avg)
}
average(c("Lady Gaga", "Tool", "Dry the River"))
```
By using a conditional statement, we can provide a more informative error message. We can check that the input is `numeric` and, if it's not, stop the function and print an error message, using `stop`:
```{r eval=FALSE}
average <- function(x)
{
if(is.numeric(x))
{
avg <- sum(x)/length(x)
return(avg)
} else
{
stop("The input must be a numeric vector.")
}
}
average(c(1, 5, 8))
average(c("Lady Gaga", "Tool", "Dry the River"))
```
$$\sim$$
```{exercise, label="ch6exc3"}
Which of the following conditions are `TRUE`? First think about the answer, and then check it using R.
```{r eval=FALSE}
x <- 2
y <- 3
z <- -3
```
1. `x > 2`
2. `x > y | x > z`
3. `x > y & x > z`
4. `abs(x*z) >= y`
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions3)
<br>
```{exercise, label="ch6exc4"}
Fix the errors in the following code:
```{r eval=FALSE}
x <- c(1, 2, pi, 8)
# Only compute square roots if x exists
# and contains positive values:
if(exists(x)) { if(x > 0) { sqrt(x) } }
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions4)
## Iteration using loops {#loopsection}
We have already seen how you can use functions to make it easier to repeat the same task over and over. But there is still a part of the puzzle missing - what if, for example, you wish to apply a function to each column of a data frame? What if you want to apply it to data from a number of files, one at a time? The solution to these problems is to use _iteration_\index{iteration}. In this section, we'll explore how to perform iteration using _loops_\index{loop}.
### `for` loops {#forloops}
`for` loops can be used to run the same code several times, with different settings, e.g. different data, in each iteration. Their use is perhaps best explained by some examples. We create the loop using `for`\index{\texttt{for}}\index{loop!\texttt{for}}, give the name of a _control variable_ and a vector containing its values (the control variable controls how many iterations to run) and then write the code that should be repeated in each iteration of the loop. In each iteration, a new value of the control variable is used in the code, and the loop stops when all values have been used.
As a first example, let's write a `for` loop that runs a block of code five times, where the block prints the current iteration number:
```{r eval=FALSE}
for(i in 1:5)
{
cat("Iteration", i, "\n")
}
```
This is equivalent to writing:
```{r eval=FALSE}
cat("Iteration", 1, "\n")
cat("Iteration", 2, "\n")
cat("Iteration", 3, "\n")
cat("Iteration", 4, "\n")
cat("Iteration", 5, "\n")
```
The upside is that we didn't have to copy and edit the same code multiple times - and as you can imagine, this benefit becomes even more pronounced if you have more complicated code blocks.
The values for the control variable are given in a vector, and the code block will be run once for each element in the vector - we say the we _loop over the values in the vector_. The vector doesn't have to be `numeric` - here is an example with a `character` vector:
```{r eval=FALSE}
for(word in c("one", "two", "five hundred and fifty five"))
{
cat("Iteration", word, "\n")
}
```
Of course, loops are used for so much more than merely printing text on the screen. A common use is to perform some computation and then store the result in a vector. In this case, we must first create an empty vector to store the result in, e.g. using `vector`\index{\texttt{vector}}, which creates an empty vector of a specific type and length:
```{r eval=FALSE}
squares <- vector("numeric", 5)
for(i in 1:5)
{
squares[i] <- i^2
}
squares
```
In this case, it would have been both simpler and computationally faster to compute the squared values by running `(1:5)^2`. This is known as a _vectorised_ solution, and is very important in R. We'll discuss vectorised solutions in detail in Section \@ref(vectorloops).
When creating the values used for the control variable, we often wish to create different sequences of numbers. Two functions that are very useful for this are `seq`\index{\texttt{seq}}, which creates sequences, and `rep`\index{\texttt{rep}}, which repeats patterns:
```{r eval=FALSE}
seq(0, 100)
seq(0, 100, by = 10)
seq(0, 100, length.out = 21)
rep(1, 4)
rep(c(1, 2), 4)
rep(c(1, 2), c(4, 2))
```
Finally, `seq_along`\index{\texttt{seq\_along}} can be used to create a sequence of indices for a vector of a data frame, which is useful if you wish to iterate some code for each element of a vector or each column of a data frame:
```{r eval=FALSE}
seq_along(airquality) # Gives the indices of all column of the data
# frame
seq_along(airquality$Temp) # Gives the indices of all elements of the
# vector
```
Here is an example of how to use `seq_along` to compute the mean of each column of a data frame:
```{r eval=FALSE}
# Compute the mean for each column of the airquality data:
means <- vector("double", ncol(airquality))
# Loop over the variables in airquality:
for(i in seq_along(airquality))
{
means[i] <- mean(airquality[[i]], na.rm = TRUE)
}
# Check that the results agree with those from the colMeans function:
means
colMeans(airquality, na.rm = TRUE)
```
The line inside the loop could have read `means[i] <- mean(airquality[,i], na.rm = TRUE)`, but that would have caused problems if we'd used it with a `data.table` or `tibble` object; see Section \@ref(subsetusingcn).
Finally, we can also change the values of the data in each iteration of the loop. Some machine learning methods require that the data is _standardised_\index{data!standardise}, i.e. that all columns have mean 0 and standard deviation 1. This is achieved by subtracting the mean from each variable and then dividing each variable by its standard deviation. We can write a function for this that uses a loop, changing the values of a column in each iteration:
```{r eval=FALSE}
standardise <- function(df, ...)
{
for(i in seq_along(df))
{
df[[i]] <- (df[[i]] - mean(df[[i]], ...))/sd(df[[i]], ...)
}
return(df)
}
# Try it out:
aqs <- standardise(airquality, na.rm = TRUE)
colMeans(aqs, na.rm = TRUE) # Non-zero due to floating point
# arithmetics!
sd(aqs$Wind)
```
$$\sim$$
```{exercise, label="ch6exc5"}
Practice writing `for` loops by doing the following:
1. Compute the mean temperature for each month in the `airquality` dataset using a loop rather than an existing function.
2. Use a `for` loop to compute the maximum and minimum value of each column of the `airquality` data frame, storing the results in a data frame.
3. Make your solution to the previous task reusable by writing a function that returns the maximum and minimum value of each column of a data frame.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions5)
<br>
```{exercise, label="ch6exc5b"}
Use `rep` or `seq` to create the following vectors:
1. `0.25 0.5 0.75 1`
2. `1 1 1 2 2 5`
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions5b)
<br>
```{exercise, label="ch6exc6"}
As an alternative to `seq_along(airquality)` and `seq_along(airquality$Temp)`, we could create the same sequences using `1:ncol(airquality)` and `1:length(airquality$Temp)`. Use `x <- c()` to create a vector of length zero. Then create loops that use `seq_along(x)` and `1:length(x)` as values for the control variable. How many iterations are the two loops run? Which solution is preferable?
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions6)
<br>
```{exercise, label="ch6exc7"}
An alternative to standardisation is _normalisation_, where all `numeric` variables are rescaled so that their smallest value is 0 and their largest value is 1. Write a function that normalises the variables in a data frame containing `numeric` columns.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions7)
<br>
```{exercise, label="ch6exc10"}
The function `list.files`\index{\texttt{list.files}} can be used to create a vector containing the names of all files in a folder. The `pattern` argument can be used to supply a regular expression describing a file name pattern. For instance, if `pattern = "\\.csv$"` is used, only `.csv` files will be listed.
Create a loop that goes through all `.csv` files in a folder and prints the names of the variables for each file.
```
`r if (knitr::is_html_output()) '[(Click here to go to the solution.)]' else '[]'`(#ch6solutions10)
### Loops within loops {#nestedloops}
In some situations, you'll want to put a loop inside another loop. Such loops are said to be _nested_\index{loop!nested}. An example is if we want to compute the correlation between all pairs of variables in `airquality`, and store the result in a matrix:
```{r eval=FALSE}
cor_mat <- matrix(NA, nrow = ncol(airquality),
ncol = ncol(airquality))
for(i in seq_along(airquality))
{
for(j in seq_along(airquality))
{
cor_mat[i, j] <- cor(airquality[[i]], airquality[[j]],
use = "pairwise.complete")
}
}
# Element [i, j] of the matrix now contains the correlation between
# variables i and j:
cor_mat
```
Once again, there is a vectorised solution to this problem, given by `cor(airquality, use = "pairwise.complete")`. As we will see in Section \@ref(measureperformance), vectorised solutions like this can be several times faster than solutions that use nested loops. In general, solutions involving nested loops tend to be fairly slow - but on the other hand, they are often easy and straightforward to implement.
### Keeping track of what's happening {#beepr}
Sometimes each iteration of your loop takes a long time to run, and you'll want to monitor its progress. This can be done using printed messages or a progress bar in the Console panel, or sound notifications. We'll showcase each of these using a loop containing a call to `Sys.sleep`\index{\texttt{Sys.sleep}}, which pauses the execution of R commands for a short time (determined by the user).
First, we can use `cat` to print a message describing the progress. Adding `\r` to the end of a string allows us to print all messages on the same line, with each new message replacing the old one:
```{r eval=FALSE}
# Print each message on a new same line:
for(i in 1:5)
{
cat("Step", i, "out of 5\n")
Sys.sleep(1) # Sleep for 1 second
}
# Replace the previous message with the new one:
for(i in 1:5)
{
cat("Step", i, "out of 5\r")
Sys.sleep(1) # Sleep for one second
}
```
Adding a progress bar is a little more complicated, because we must first start the bar by using `txtProgressBar`\index{\texttt{txtProgressBar}} and the update it using `setTxtProgressBar`\index{\texttt{setTxtProgressBar}}:
```{r eval=FALSE}
sequence <- 1:5
pbar <- txtProgressBar(min = 0, max = max(sequence), style = 3)
for(i in sequence)
{
Sys.sleep(1) # Sleep for 1 second
setTxtProgressBar(pbar, i)
}
close(pbar)
```
Finally, the `beepr`\index{\texttt{beepr}} package^[Arguably the best add-on package for R.] can be used to play sounds, with the function `beep`\index{\texttt{beep}}:
```{r eval=FALSE}
install.packages("beepr")
library(beepr)
# Play all 11 sounds available in beepr:
for(i in 1:11)
{
beep(sound = i)
Sys.sleep(2) # Sleep for 2 seconds
}
```
### Loops and lists {#lists}
In our previous examples of loops, it has always been clear from the start how many iterations the loop should run and what the length of the output vector (or data frame) should be. This isn't always the case. To begin with, let's consider the case where the length of the output is unknown or difficult to know in advance. Let's say that we want to go through the `airquality` data to find days that are extreme in the sense that at least one variable attains its maximum on those days. That is, we wish to find the index of the maximum of each variable, and store them in a vector. Because several days can have the same temperature or wind speed, there may be more than one such maximal index for each variable. For that reason, we don't know the length of the output vector in advance.
In such cases, it is usually a good idea to store the result from each iteration in a `list` (Section \@ref(lists2)), and then collect the elements from the list once the loop has finished. We can create an empty list with one element for each variable in `airquality` using `vector`:
```{r eval=FALSE}
# Create an empty list with one element for each variable in
# airquality:
max_list <- vector("list", ncol(airquality))
# Naming the list elements will help us see which variable the maximal
# indices belong to:
names(max_list) <- names(airquality)
# Loop over the variables to find the maxima:
for(i in seq_along(airquality))
{
# Find indices of maximum values:
max_index <- which(airquality[[i]] == max(airquality[[i]],
na.rm = TRUE))
# Add indices to list:
max_list[[i]] <- max_index
}
# Check results:
max_list
# Collapse to a vector:
extreme_days <- unlist(max_list)
```
(In this case, only the variables `Month` and `Days` have duplicate maximum values.)
### `while` loops {#whileloop}
In some situations, we want to run a loop until a certain condition is met, meaning that we don't know in advance how many iterations we'll need. This is more common in numerical optimisation and simulation, but sometimes also occurs in data analyses.
When we don't know in advance how many iterations that are needed, we can use `while` loops\index{\texttt{while}}\index{loop!\texttt{while}}. Unlike `for` loops, that iterate a fixed number of times, `while` loops keep iterating as long as some specified condition is met. Here is an example where the loop keeps iterating until `i` squared is greater than 100:
```{r eval=FALSE}
i <- 1
while(i^2 <= 100)
{
cat(i,"squared is", i^2, "\n")
i <- i +1
}
i
```
The code block inside the loop keeps repeating until the condition `i^2 <= 100` no longer is satisfied. We have to be a little bit careful with this condition - if we set it in such a way that it is possible that the condition _always_ will be satisfied, the loop will just keep running and running - creating what is known as an _infinite loop_\index{infinite loop|see {infinite loop}}. If you've accidentally created an infinite loop, you can break it by pressing the Stop button at the top of the Console panel in RStudio.
In Section \@ref(rle) we saw how `rle` can be used to find and compute the lengths of runs of equal values in a vector. We can use nested `while` loops to create something similar. `while` loops are a good choice here, because we don't know how many runs are in the vector in advance. Here is an example, which you'll study in more detail in Exercise \@ref(exr:ch6exc8):
```{r eval=FALSE}
# Create a vector of 0's and 1's:
x <- rep(c(0, 1, 0, 1, 0), c(5, 1, 4, 2, 7))
# Create empty vectors where the results will be stored:
run_values <- run_lengths <- c()
# Set the initial condition:
i <- 1
# Iterate over the entire vector:
while(i < length(x))