data-structure-basics.Rmd

# Vectors, lists, and tibbles

```{r message = FALSE, warning = FALSE}
library(tidyverse)
```

In R, vectors are the most common data structure. In this book, we'll often represent vectors like this:

```{r echo=FALSE}
knitr::include_graphics(
  "images/data-structure-basics/vector.png", dpi = image_dpi
)
```

Each orange cell represents one element of the vector. As you'll see, different kinds of vectors can hold different kinds of elements.

There are two kinds of vectors: _atomic vectors_ and _lists_. _Tibbles_ are a specific kind of list.

```{r echo=FALSE}
knitr::include_graphics(
  "images/data-structure-basics/vector-atomic-list-tibble.png", 
  dpi = image_dpi
)
```

In this chapter, we'll cover these three data structures, explaining how they differ and showing you how to manipulate each one.

## Atomic vectors

Atomic vectors are the "atoms" of R---the simple building blocks upon which all else is built. There are four types of atomic vector that are important for data analysis:

* __integer__ vectors (`<int>`) contain integers.
* __double__ vectors (`<dbl>`) contain real numbers. 
* __character__ vectors (`<chr>`) contain strings made with `""`.
* __logical__ vectors (`<lgl>`) contain `TRUE` or `FALSE`.

Integer atomic vectors contain only integers, double atomic vectors contain only doubles, and so on. Together, integer and double vectors are known as _numeric_ vectors. All vectors can also contain the missing value `NA`. 

In R, single numbers, logicals, and strings are just atomic vectors of length 1, so

```{r}
x <- "a"
```

creates a character vector. Likewise, 

```{r}
y <- 1.1
```

creates a double vector.

To create atomic vectors with more than one element, use `c()` to combine values.

```{r}
logical <- c(TRUE, FALSE, FALSE)

double <- c(1.5, 2.8, pi)

character <- c("this", "is", "a character", "vector")
```

To create an integer vector by hand, you'll need to add `L` to the end of each number.

```{r}
integer <- c(1L, 2L, 3L)
```

Without the `L`s, R will create a vector of doubles.

### Properties

Vectors (both atomic vectors and lists) all have two key properties: _type_ and _length_.

You can check the type of any vector with `typeof()`.

```{r}
typeof(c(1.5, 2.8, pi))

typeof(c(1L, 3L, 4L))
```

Use `length()` to find a vector's length.

```{r}
length(c("a", "b", "c"))
```

Vector can also have named elements. 

```{r}
v_named <- c(guava = 2, pineapple = 4, dragonfruit = 1)
v_named
```

You can access a vector's names with `names()`.

```{r}
names(v_named)
```

### Subsetting

`v` is an atomic vector of doubles. 

```{r}
v <- c(1, 2.7, 4, 5)
```

You can _subset_ `v` to create a new vector with selected elements, ignoring the others.

#### The `[` operator {-}

Here are four ways to subset `v` using the `[` operator.

##### Positive integers {-}

Subset with a vector of positive integers to extract elements by position. 

```{r}
v[c(1, 2)]
```

Note that, in R, indices start at 1, not 0, so the above code extracts the first two elements of `v`.

You can also use `:` to create a vector of adjacent integers. The following select the first three elements of `v`.

```{r}
v[1:3]
```

##### Negative integers {-}

Subset with a vector of negative integers to exclude elements. The following code removes the first and third elements of `v`.

```{r}
v[-c(1, 3)]
```

##### Names {-}

If a vector has names, you can subset with a character vector. 

```{r}
v_named[c("guava", "dragonfruit")]
```

##### Logical vectors {-}

If you supply a vector of `TRUE`s and `FALSE`s, `[` will select the elements that correspond to the `TRUE`s. 

```{r echo=FALSE}
knitr::include_graphics("images/vector-subset.png", dpi = image_dpi)
```

The following extracts just the first and third elements. 

```{r}
v[c(TRUE, FALSE, TRUE, FALSE)]
```

You'll rarely subset by typing out `TRUE`s and `FALSE`s. Instead, you'll typically create a logical vector with a function or condition.

For example, the following code selects just the elements of `v` greater than 2.

```{r}
v[v > 2]
```

`v > 2` results in a logical vector the same length as `v`.

```{r}
v > 2
```

`[` then uses this logical vector to subset `v`, resulting in just the elements of `v` greater than 2. 

`v_missing` has `NA`s.

```{r}
v_missing <- c(1.1, NA, 5, 6, NA)
```

We can pass `!is.na(v_missing)` into `[` to extract out just the non-`NA` elements.

```{r}
v_missing[!is.na(v_missing)]
```

#### The `[[` operator {-}

You can extract single elements from a vector with the `[[` operator.

```{r}
v[[2]]
```

You'll get an error if you try to use `[[` to select more than one element.

```{r error=TRUE}
v[[2:3]]
```

For named vectors, the `[` operator creates a subset of the original vector with the specified elements.

```{r}
v_named["guava"]
```

The `[[` operator returns only the extracted element.

```{r}
v_named[["guava"]]
```

As you'll see in the [Lists](#lists) section that the distinction between `[` and `[[` becomes more important with lists and tibbles. 

### Applying functions

Vectors are central to programming in R, and many R functions are designed to work with vectors. 

You already saw how to call `typeof()` to return the type of a vector. 

```{r}
typeof(v)
```

`sum()` sums a vector's elements.

```{r}
sum(v)
```

You can use `sum()` with both numeric (i.e., double and integer) vectors, as well as with logical vectors.

```{r}
logical <- c(TRUE, TRUE, FALSE)

sum(logical)
```

When applied to a logical vector, `sum()` returns the number of `TRUE`s.

`mean()` works similarly.

```{r}
mean(v)

mean(logical)
```

The [Base R Cheat Sheet](https://github.com/dcl-docs/prog/blob/master/data/data-structure-basics/Cheat_sheet_base_R.pdf) has some other basic helpful functions, particularly under the _Vector Functions_ and _Math Functions_ sections. 

### Augmented vectors

Augmented vectors are atomic vectors with additional metadata. There are four important augmented vectors:

* __factors__ `<fct>`, which are used to represent categorical variables can take
  one of a fixed and known set of possible values (called the levels).
  
* __ordered factors__ `<ord>`, which are like factors but where the levels have an
  intrinsic ordering (i.e. it's reasonable to say that one level is "less than"
  or "greater than" another level).
  
* __dates__ `<dt>`, record a date.

* __date-times__ `<dttm>`, which are also known as POSIXct, record a date
  and a time.

For now, you just need to recognize these when you encounter them. You'll learn how to create each type of augmented vector later in the course.

## Lists

Unlike atomic vectors, which can only contain a single type, lists can contain any collection of R objects. 

### Basics

The following reading will introduce you to lists:

* [Recursive vectors (lists)](https://r4ds.had.co.nz/vectors.html#lists)

### Flattening

You can flatten a list into an atomic vector with `unlist()`.

```{r}
y <- list(1, 2, 4)
y
```

```{r}
unlist(y)
```

`unlist()` returns an atomic vector even if the original list contains other lists or vectors.

```{r}
z <- list(c(1, 2), c(3, 4))
z
```

```{r}
unlist(z)
```

## Tibbles

Tibbles are actually lists. 

```{r}
typeof(mpg)
```

Every tibble is a named list of vectors, each of the same length.

```{r echo=FALSE}
knitr::include_graphics(
  "images/data-structure-basics/tibble.png", 
  dpi = image_dpi
)
```

These vectors form the tibble columns.

Take the tibble `mpg`.

```{r}
mpg
```

Each variable in `mpg` (`manufacturer`, `model`, `displ`, etc.) is a vector. `manufacturer` is a character vector, `displ` is a double vector, and so on. Each vector has length `r nrow(ggplot2::mpg)`.

### Creation

There are two ways to create tibbles by hand. First, you can use `tibble()`. 

```{r}
my_tibble <- 
  tibble(
  x = c(1, 9, 5),
  y = c(TRUE, FALSE, FALSE),
  z = c("apple", "pear", "banana")
)

my_tibble
```

`tibble()` takes individual vectors and turns them into a tibble. 

Second, you can use `tribble()`.

```{r}
my_tibble <- 
  tribble(
    ~x,  ~y,    ~z,
     1,  TRUE,  "apple",
     9,  FALSE, "pear",
     5,  FALSE, "banana"
  )
    
my_tibble
```

For small datasets, `tribble()` is likely to be less error prone, since the values for each row are next to each other.

### Subsetting

There are several ways to extract variables out of tibbles. Tibbles are lists, so `[[` and `$` work.

```{r}
my_tibble[["x"]]

my_tibble$x
```

Use `pull()`, the dplyr equivalent, when you want to use a pipe.

```{r}
my_tibble %>% 
  pull(x)
```

Note that `pull()`, like `[[` and `$`, returns just the vector of values for a given variable.

`[` also works with tibbles.

```{r}
my_tibble["x"]
```

For a pipe, use the dplyr function `select()`.

```{r}
my_tibble %>% 
  select(x)
```

Whereas `pull()` returns a vector, `select()` returns a tibble.

### Dimensions

Printing a tibble tells you the column names and overall dimensions.

```{r}
mpg
```

To access the dimensions directly, you have three options:

```{r}
dim(diamonds)

nrow(diamonds)

ncol(diamonds)
```

To get the variable names, use `names()`:

```{r}
names(diamonds)
```