diff --git a/02-starting-with-data.Rmd b/02-starting-with-data.Rmd index 044a9097..8125b07a 100644 --- a/02-starting-with-data.Rmd +++ b/02-starting-with-data.Rmd @@ -24,14 +24,14 @@ source("setup.R") ------------ -## Presentation of the Survey Data +## Loading the survey data ```{r, echo=FALSE, purl=TRUE} -### Presentation of the survey data +### Loading the survey data ``` -We are investigating the animal species diversity and weights found within plots at our study -site. The dataset is stored as a comma separated value (CSV) file. +We are investigating the animal species diversity and weights found within plots +at our study site. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent: | Column | Description | @@ -50,66 +50,102 @@ Each row holds information for a single animal, and the columns represent: | taxon | e.g. Rodent, Reptile, Bird, Rabbit | | plot\_type | type of plot | +### Downloading the data + We are going to use the R function `download.file()` to download the CSV file -that contains the survey data from Figshare, and we will use `read.csv()` to -load into memory the content of the CSV file as an object of class `data.frame`. -Inside the download.file command, the first entry is a character string with the source URL -("https://ndownloader.figshare.com/files/2292169"). This source URL downloads a CSV file from -figshare. The text after the comma ("data_raw/portal_data_joined.csv") is the destination of the -file on your local machine. You'll need to have a folder on your machine called "data_raw" where -you'll download the file. So this command downloads a file from Figshare, names it -"portal_data_joined.csv" and adds it to a preexisting folder named "data_raw". +that contains the survey data from Figshare, and we will use `read_csv()` to +load the content of the CSV file into R. + +Inside the `download.file` command, the first entry is a character string with the +source URL ("https://ndownloader.figshare.com/files/2292169"). +This source URL downloads a CSV file from figshare. The text after the comma +("data_raw/portal_data_joined.csv") is the destination of the file on your local +machine. You'll need to have a folder on your machine called "data_raw" where +you'll download the file. So this command downloads a file from Figshare, names +it "portal_data_joined.csv" and adds it to a preexisting folder named "data_raw". ```{r, eval=FALSE, purl=TRUE} download.file(url = "https://ndownloader.figshare.com/files/2292169", destfile = "data_raw/portal_data_joined.csv") ``` + +### Reading the data into R + The file has now been downloaded to the destination you specified, but R has not yet loaded the data from the file into memory. To do this, we can use the -`read.csv()` function: +`read_csv()` function from the **`tidyverse`** package. + +Packages in R are basically sets of additional functions that let you do more +stuff. The functions we've been using so far, like `round()`, `sqrt()`, or `c()`, +come built into R; packages give you access to additional functions. +Before you use a package for the first time you need to install it on your +machine, and then you should import it in every subsequent R session when you +need it. + +To install the **`tidyverse`** package, we can type +`install.packages("tidyverse")` straight into the console. In fact, it's better +to write this in the console than in our script for any package, as there's no +need to re-install packages every time we run the script. +Then, to load the package type: + +```{r, message = FALSE, purl = FALSE} +## load the tidyverse packages, incl. dplyr +library(tidyverse) +``` + +Now we can use the functions from the **`tidyverse`** package. +Let's use `read_csv()` to read the data into a data frame +(we will learn more about data frames later): ```{r, eval=TRUE, purl=FALSE} -surveys <- read.csv("data_raw/portal_data_joined.csv") +surveys <- read_csv("data_raw/portal_data_joined.csv") ``` -This statement doesn't produce any output because, as you might recall, -assignments don't display anything. If we want to check that our data has been -loaded, we can see the contents of the data frame by typing its name: `surveys`. +You will see the message `Parsed with column specification`, followed by each +column name and its data type. +When you execute `read_csv` on a data file, it looks through the first 1000 rows +of each column and guesses its data type. For example, in this dataset, +`read_csv()` reads `weight` as `col_double` (a numeric data type), and `species` +as `col_character`. You have the option to specify the data type for a column +manually by using the `col_types` argument in `read_csv`. + +We can see the contents of the first few lines of the data by typing its +name: `surveys`. By default, this will show show you as many rows and columns of +the data as fit on your screen. +If you wanted to the first 50 rows, you could type `print(surveys, n = 50)` -Wow... that was a lot of output. At least it means the data loaded -properly. Let's check the top (the first 6 lines) of this data frame using the -function `head()`: +We can also extract the first few lines of this data using the function +`head()`: ```{r, results='show', purl=FALSE} head(surveys) ``` -There is a similar function which lets you view the last few lines of the data -set. It is called (you might have guessed it) `tail()`. +Unlike the `print()` function, `head()` returns the extracted data. You could +use it to assign the first 100 rows of `surveys` to an object using +`surveys_sample <- head(surveys, 100)`. This can be useful if you want to try +out complex computations on a subset of your data before you apply them to the +whole data set. +There is a similar function that lets you extract the last few lines of the data +set. It is called (you might have guessed it) `tail()`. -To open the data frame in RStudio's Data Viewer, use the `View()` function: +To open the dataset in RStudio's Data Viewer, use the `view()` function: ```{r, eval = FALSE, purl = FALSE} -View(surveys) # note the capital V! +view(surveys) ``` > ### Note > -> `read.csv` assumes that fields are delineated by commas, however, in several +> `read_csv()` assumes that fields are delineated by commas, however, in several > countries, the comma is used as a decimal separator and the semicolon (;) is > used as a field delineator. If you want to read in this type of files in R, -> you can use the `read.csv2` function. It behaves exactly like `read.csv` but -> uses different parameters for the decimal and the field separators. If you are -> working with another format, they can be both specified by the user. Check out -> the help for `read.csv()` by typing `?read.csv` to learn more. There is also the `read.delim()` for -> in tab separated data files. It is important to note that all of these functions -> are actually wrapper functions for the main `read.table()` function with different arguments. -> As such, the surveys data above could have also been loaded by using `read.table()` -> with the separation argument as `,`. The code is as follows: -> `surveys <- read.table(file="data_raw/portal_data_joined.csv", sep=",", header=TRUE)`. -> The header argument has to be set to TRUE to be able to read the headers as -> by default `read.table()` has the header argument set to FALSE. +> you can use the `read_csv2()` function. It behaves like `read_csv()` but +> uses different parameters for the decimal and the field separators. +There is also the `read_tsv()` for tab separated data files and `read_delim()` +> for less common formats. +> Check out the help for `read_csv()` by typing `?read_csv` to learn more. > > In addition to the above versions of the csv format, you should develop the habits > of looking at and record some parameters of your csv files. For instance, @@ -120,12 +156,15 @@ View(surveys) # note the capital V! ## What are data frames? +When we loaded the data into R, it got stored as an object of class `tibble`, +which is a special kind of data frame (the difference is not important for our +purposes, but you can learn more about tibbles +[here](https://tibble.tidyverse.org/)). Data frames are the _de facto_ data structure for most tabular data, and what we use for statistics and plotting. - -A data frame can be created by hand, but most commonly they are generated by the -functions `read.csv()` or `read.table()`; in other words, when importing -spreadsheets from your hard drive (or the web). +Data frames can be created by hand, but most commonly they are generated by +functions like `read_csv()`; in other words, when importing +spreadsheets from your hard drive or the web. A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are @@ -134,16 +173,14 @@ factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector. ![](./img/data-frame.svg) - - -We can see this when inspecting the structure of a data frame +We can see this also when inspecting the structure of a data frame with the function `str()`: ```{r, purl=FALSE} str(surveys) ``` - -## Inspecting `data.frame` Objects + +## Inspecting data frames We already saw how the functions `head()` and `str()` can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of @@ -180,7 +217,6 @@ objects besides `data.frame`. > > * What is the class of the object `surveys`? > * How many rows and how many columns are in this object? -> * How many species have been recorded during these surveys? > > ```{r, answer=TRUE, results="markup", purl=FALSE} > @@ -188,7 +224,6 @@ objects besides `data.frame`. > > ## * class: data frame > ## * how many rows: 34786, how many columns: 13 -> ## * how many species: 48 > > ``` @@ -199,7 +234,6 @@ objects besides `data.frame`. ## Based on the output of `str(surveys)`, can you answer the following questions? ## * What is the class of the object `surveys`? ## * How many rows and how many columns are in this object? -## * How many species have been recorded during these surveys? ``` @@ -324,14 +358,34 @@ In RStudio, you can use the autocompletion feature to get the full and correct n When we did `str(surveys)` we saw that several of the columns consist of integers. The columns `genus`, `species`, `sex`, `plot_type`, ... however, are -of a special class called `factor`. Factors are very useful and actually -contribute to making R particularly well suited to working with data. So we are -going to spend a little time introducing them. +of the class `character`. +Arguably, these columns contain categorical data, that is, they can only take on +a limited number of values. -Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings. +R has a special class for working with categorical data, called `factor`. +Factors are very useful and actually contribute to making R particularly well +suited to working with data. So we are going to spend a little time introducing +them. Once created, factors can only contain a pre-defined set of values, known as -*levels*. By default, R always sorts levels in alphabetical order. For +*levels*. +Factors are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings. + +When importing a data frame with `read_csv()`, the columns that contain text are not automatically coerced (=converted) into the `factor` data type, but once we have +loaded the data we can do the conversion using the `factor()` function: + +```{r, purl=FALSE} +surveys$sex <- factor(surveys$sex) +``` + +We can see that the conversion has worked by using the `summary()` +function again. This produces a table with the counts for each factor level: + +```{r, purl=FALSE} +summary(surveys$sex) +``` + +By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels: ```{r, purl=TRUE} @@ -366,6 +420,40 @@ be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the species names in our example dataset). + +> ### Challenge +> +> 1 Change the columns `taxa` and `genus` in the `surveys` data frame into a +> factor. +> +> 2 Using the functions you learned before, can you find out... +> +> * How many rabbits were observed? +> * How many different genera are in the `genus` column? +> +> ```{r, answer=TRUE, purl=FALSE} +> surveys$genus <- factor(surveys$genus) +> summary(surveys) +> nlevels(surveys$genus) +> +> ## * how many genera: There are 26 unique genera in the `genus` column. +> ## * how many rabbts: There are 75 rabbits in the `taxa` column. +> ``` + +```{r, echo=FALSE, purl=TRUE} +### Challenges: +### +### 1 Change the columns `taxa` and `genus` in the `surveys` data frame into a +### factor. +### +### 2 Using the functions you learned before, can you find out... +### +### * How many rabbits were observed? +### * How many different genera are in the `genus` column? + +``` + + ### Converting factors If you need to convert a factor to a character vector, you use @@ -411,25 +499,32 @@ the experiment: ```{r, purl=TRUE} ## bar plot of the number of females and males captured during the experiment: -plot(as.factor(surveys$sex)) +plot(surveys$sex) ``` -In addition to males and females, there are about 1700 individuals for which the -sex information hasn't been recorded. Additionally, for these individuals, -there is no label to indicate that the information is missing or undetermined. Let's rename this -label to something more meaningful. Before doing that, we're going to pull out -the data on sex and work with that data, so we're not modifying the working copy -of the data frame: +Howver, as we saw when we used `summary(surveys$sex)`, there are about 1700 +individuals for which the sex information hasn't been recorded. To show them in +the plot, we can turn the missing values into a factor level with the +`addNA()` function. We will also have to give the new factor level a label. +We are going to work with a copy of the `sex` column, so we're not modifying the +working copy of the data frame: ```{r, results=TRUE, purl=FALSE} -sex <- factor(surveys$sex) -head(sex) +sex <- surveys$sex levels(sex) -levels(sex)[1] <- "undetermined" +sex <- addNA(sex) +levels(sex) +head(sex) +levels(sex)[3] <- "undetermined" levels(sex) head(sex) ``` +Now we can plot the data again, using `plot(sex)`. + +```{r echo=FALSE, purl=FALSE, results=TRUE} +plot(sex) +``` > ### Challenge > @@ -438,7 +533,7 @@ head(sex) > barplot such that "undetermined" is last (after "male")? > > ```{r, answer=TRUE, purl=FALSE} -> levels(sex)[2:3] <- c("female", "male") +> levels(sex)[1:2] <- c("female", "male") > sex <- factor(sex, levels = c("female", "male", "undetermined")) > plot(sex) > ``` @@ -453,32 +548,9 @@ head(sex) ``` -### Using `stringsAsFactors = FALSE` - -By default, when building or importing a data frame, the columns that contain -characters (i.e. text) are coerced (= converted) into factors. Depending on what you want to do with the data, you may want to keep these -columns as `character`. To do so, `read.csv()` and `read.table()` have an -argument called `stringsAsFactors` which can be set to `FALSE`. - -In most cases, it is preferable to set `stringsAsFactors = FALSE` when importing -data and to convert as a factor only the columns that require this data -type. - - -```{r, eval=FALSE, purl=TRUE} -## Compare the difference between our data read as `factor` vs `character`. -surveys <- read.csv("data_raw/portal_data_joined.csv", stringsAsFactors = TRUE) -str(surveys) -surveys <- read.csv("data_raw/portal_data_joined.csv", stringsAsFactors = FALSE) -str(surveys) -## Convert the column "plot_type" into a factor -surveys$plot_type <- factor(surveys$plot_type) -``` - - > ### Challenge > -> 1. We have seen how data frames are created when using `read.csv()`, but +> 1. We have seen how data frames are created when using `read_csv()`, but > they can also be created by hand with the `data.frame()` function. There are > a few mistakes in this hand-crafted `data.frame`. Can you spot and fix them? > Don't hesitate to experiment! @@ -505,7 +577,6 @@ surveys$plot_type <- factor(surveys$plot_type) > 2. Can you predict the class for each of the columns in the following example? > Check your guesses using `str(country_climate)`: > * Are they what you expected? Why? Why not? -> * What would have been different if we had added `stringsAsFactors = FALSE` when creating the data frame? > * What would you need to change to ensure that each column had the accurate data type? > > ```{r, eval=FALSE, purl=FALSE} @@ -524,8 +595,6 @@ surveys$plot_type <- factor(surveys$plot_type) > ## example? > ## Check your guesses using `str(country_climate)`: > ## * Are they what you expected? Why? why not? -> ## * What would have been different if we had added `stringsAsFactors = FALSE` -> ## when we created this data frame? > ## * What would you need to change to ensure that each column had the > ## accurate data type? > country_climate <- data.frame(country = c("Canada", "Panama", "South Africa", "Australia"), @@ -540,9 +609,8 @@ surveys$plot_type <- factor(surveys$plot_type) > * missing one entry in the `feel` column (probably for one of the furry animals) > * missing one comma in the `weight` column > * `country`, `climate`, `temperature`, and `northern_hemisphere` are -> factors; `has_kangaroo` is numeric -> * using `stringsAsFactors = FALSE` would have made character vectors instead of -> factors +> characters; `has_kangaroo` is numeric +> * using `factor()` one could replace character columns with factors columns > * removing the quotes in `temperature` and `northern_hemisphere` and replacing 1 > by TRUE in the `has_kangaroo` column would give what was probably > intended @@ -557,7 +625,7 @@ entry (for instance, a letter in a column that should only contain numbers). Learn more in this [RStudio tutorial](https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio) -## Formatting Dates +## Formatting dates One of the most common issues that new (and experienced!) R users have is converting date and time information into a variable that is appropriate and diff --git a/03-dplyr.Rmd b/03-dplyr.Rmd index 33788b1a..7f2d73b0 100644 --- a/03-dplyr.Rmd +++ b/03-dplyr.Rmd @@ -32,21 +32,14 @@ suppressWarnings(surveys$date <- lubridate::ymd(paste(surveys$year, ------------ -# Data Manipulation using **`dplyr`** and **`tidyr`** +# Data manipulation using **`dplyr`** and **`tidyr`** Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. Enter **`dplyr`**. **`dplyr`** is a package for making tabular data manipulation easier. It pairs nicely with **`tidyr`** which enables you to swiftly convert between different data formats for plotting and analysis. -Packages in R are basically sets of additional functions that let you do more -stuff. The functions we've been using so far, like `str()` or `data.frame()`, -come built into R; packages give you access to more of them. Before you use a -package for the first time you need to install it on your machine, and then you -should import it in every subsequent R session when you need it. You should -already have installed the **`tidyverse`** package. This is an -"umbrella-package" that installs several packages useful for data analysis which -work together well such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. - +The **`tidyverse`** package is an +"umbrella-package" that installs **`tidyr`**, **`dplyr`**, and several other packages useful for data analysis, such as **`ggplot2`**, **`tibble`**, etc. The **`tidyverse`** package tries to address 3 common issues that arise when doing data analysis with some of the functions that come with R: @@ -57,19 +50,8 @@ doing data analysis with some of the functions that come with R: 3. Hidden arguments, having default operations that new learners are not aware of. -We have seen in our previous lesson that when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the `factor` data type. We had to set **`stringsAsFactors`** to **`FALSE`** to avoid this hidden argument to convert our data type. - -This time we will use the **`tidyverse`** package to read the data and avoid having to set **`stringsAsFactors`** to **`FALSE`** - -If we haven't already done so, we can type `install.packages("tidyverse")` straight into the console. In fact, it's better to write this in the console than in our script for any package, as there's no need to re-install packages every time we run the script. - -Then, to load the package type: - - -```{r, message = FALSE, purl = FALSE} -## load the tidyverse packages, incl. dplyr -library(tidyverse) -``` +You should already have installed and loaded the **`tidyverse`** package. +If we haven't already done so, we can type `install.packages("tidyverse")` straight into the console. Then, to load the package type `library(tidyverse)` ## What are **`dplyr`** and **`tidyr`**? @@ -99,19 +81,14 @@ To learn more about **`dplyr`** and **`tidyr`** after the workshop, you may want [handy data transformation with **`dplyr`** cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) and this [one about **`tidyr`**](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf). -We'll read in our data using the `read_csv()` function, from the tidyverse package **`readr`**, instead of `read.csv()`. +As before, we'll read in our data using the `read_csv()` function from the +tidyverse package **`readr`**. ```{r, results = 'hide', purl = FALSE} surveys <- read_csv("data_raw/portal_data_joined.csv") ``` -You will see the message `Parsed with column specification`, followed by each column name and its data type. -When you execute `read_csv` on a data file, it looks through the first 1000 rows of each column and -guesses the data type for each column as it reads it into R. For example, in this dataset, `read_csv` -reads `weight` as `col_double` (a numeric data type), and `species` as `col_character`. You have the -option to specify the data type for a column manually by using the `col_types` argument in `read_csv`. - ```{r, results = 'hide', purl = FALSE} ## inspect the data str(surveys) @@ -119,22 +96,10 @@ str(surveys) ```{r, eval=FALSE, purl=FALSE} ## preview the data -View(surveys) +view(surveys) ``` -Notice that the class of the data is now `tbl_df` - -This is referred to as a "tibble". Tibbles tweak some of the behaviors of the data frame objects we -introduced in the previous episode. The data structure is very similar to a data frame. For our -purposes the only differences are that: - -1. In addition to displaying the data type of each column under its name, it - only prints the first few rows of data and only as many columns as fit on one - screen. -2. Columns of class `character` are never converted into factors. - - -We're going to learn some of the most common **`dplyr`** functions: +Next, we're going to learn some of the most common **`dplyr`** functions: - `select()`: subset columns - `filter()`: subset rows on conditions diff --git a/reference.md b/reference.md index 09254602..4c008f0b 100644 --- a/reference.md +++ b/reference.md @@ -18,10 +18,10 @@ Cheat sheet of functions used in the lessons ## Lesson 2 -- Starting with data * `download.file() ` # download files from the internet to your computer - * `read.csv() ` # load CSV file into R memory + * `read_csv() ` # load CSV file into R memory * `head() ` # shows the first 6 rows - * `View()` # invoke a spreadsheet-style data viewer - * `read.table()` # load a file in table format into R memory + * `view()` # invoke a spreadsheet-style data viewer + * `read_delim()` # load a file in table format into R memory * `str() ` # check structure of the object and information about the class, length and content of each column * `dim() ` # check dimension of data frame * `nrow() ` # returns the number of rows @@ -38,15 +38,15 @@ Cheat sheet of functions used in the lessons * `as.numeric(as.character(x))` # convert factors where the levels appear as characters to a numeric vector * `as.numeric(levels(x))[x]` # convert factors where the levels appear as numbers to a numeric vector * `plot()` # plot an object + * `addNA()` # convert NA into a factor level * `data.frame()` # create a data.frame object * `ymd()` # convert a vector representing year, month, and day to a Date vector * `paste()` # concatenate vectors after converting to character ## Lesson 3 -- Manipulating, analyzing and exporting data with tidyverse - * `read_csv()` # load a csv formatted file into R memory * `str()` # check structure of the object and information about the class, length and content of each column - * `View()` # invoke a spreadsheet-style data viewer + * `view()` # invoke a spreadsheet-style data viewer * `select() ` # select columns of a data frame * `filter() ` # allows you to select a subset of rows in a data frame * `%>% ` # pipes to select and filter at the same time