Skip to content

Commit

Permalink
Merge pull request #1070 from ethanwhite/aggregation-cleanup
Browse files Browse the repository at this point in the history
Aggregation and join lecture cleanup
  • Loading branch information
ethanwhite authored Sep 10, 2024
2 parents 939b8e2 + 48e2e77 commit d540ba5
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 31 deletions.
47 changes: 30 additions & 17 deletions materials/converting-dataframes-vectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ download.file("https://www.datacarpentry.org/semester-biology/data/shrub-volume-

* These two ways of storing data are related to one another
* A data frame is a bunch of equal length vectors that are grouped together
* So, we can extract vectors from data frames and we can also make data frames from vectors
* This is why when using `mutate()` and `summarize()` we can use any function that works on a vector
* As a result we can extract vectors from data frames and make data frames from vectors

### Extracting vectors from data frames

Expand All @@ -35,20 +36,10 @@ download.file("https://www.datacarpentry.org/semester-biology/data/shrub-volume-
surveys <- read_csv("species.csv")
```

* One common approach to extracting a column into a vector is to use `$`
* The `$` in R is short hand for `[[]]` in cases where the piece we want to get has a name
* So, we start with the object we want a part of, our `surveys` data frame
* Then the `$` with no spaces around it
* and then the name of the `species_id` column (without quotes, just to be confusing)

```r
species$species_id
```

* We can also do this using `[]`
* Remember that `[]` also mean "give me a piece of something"
* We do this using `[]`
* Remember that `[]` also mean "give me a piece of something" in R
* Let's get the `species_id` column
* `"species_id"` has to be in quotes because we we aren't using `dplyr`
* `"species_id"` has to be in quotes because we we aren't using the tidyverse

```r
species["species_id"]
Expand All @@ -62,6 +53,16 @@ species["species_id"]
species[["species_id"]]
```

* We can also use the `$`
* Shorthand for `[[]]` in cases where the piece of something we want to get has a name
* So, we start with the object we want a part of, our `surveys` data frame
* Then the `$` with no spaces around it
* and then the name of the `species_id` column (without quotes, just to be confusing)

```r
species$species_id
```

* Finally, `dplyr` has a function called `pull()`

```r
Expand All @@ -76,6 +77,8 @@ species |>
pull(species_id)
```

> Do [Extracting vectors from data frames]({{ site.baseurl }}/exercises/extracting-vectors-from-data-frames-R/).
### Combining vectors to make a data frame

* We can also combine vectors to make a data frame
Expand All @@ -91,12 +94,23 @@ area <- c(3, 5, 1.9, 2.7)
count_data <- data.frame(states = states, counts = count, regional_area = area)
```

* We can also add columns to the data from that only include a single value without first creating a vector
* To make a tibble instead of a data.frame use `tibble()`

```r
library(dplyr)

count_data <- tibble(states = states, counts = count, regional_area = area)
```

* `tibble()` is part of the `tibble` package, which gets loaded by `dplyr`
* If you want to use it without loading `dplyr` you can load `tibble` directly

* We can also add columns to the data that only include a single value without first creating a vector
* We do this by providing a name for the new column, an equals sign, and the value that we want to occur in every row
* For example, if all of this data was collected in the same year and we wanted to add that year as a column in our data frame we could do it like this

```r
count_data_year <- data.frame(year = 2022, states = states, counts = count, regional_area = area)
count_data_year <- tibble(year = 2022, states = states, counts = count, regional_area = area)
```

* `year =` sets the name of the column in the data frame
Expand All @@ -119,5 +133,4 @@ count_data_year_elev <- mutate(count_data_year, elevations = elevation)
* We can combine vectors into data frames using the `data.frame` function, which takes a series of arguments, one vector for each column we want to create in the data frame.


> Do [Extracting vectors from data frames]({{ site.baseurl }}/exercises/extracting-vectors-from-data-frames-R/).
> Do [Building data frames from vectors]({{ site.baseurl }}/exercises/building-data-frames-from-vectors-R/).
23 changes: 9 additions & 14 deletions materials/dplyr-aggregation.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,7 @@ surveys <- read_csv("surveys.csv")
group_by(surveys, year)
```

* Different looking kind of `data.frame`
* Called a tibble
* Sometimes produced by `dplyr` functions
* Source, grouping, and data type information
* The tibble produced by this function has grouping information
* Store the data frame in a variable to use in the next step

```r
Expand All @@ -45,14 +42,12 @@ surveys_by_year <- group_by(surveys, year)

* After grouping a data frame use `summarize()` to calculate values for each group.
* Count the number of rows for each group (individuals in each species).
* `summarize`
* Arguments
* Table to work on, which needs to be a grouped table
* One additional argument for each calculation we want to do for each group
* New column name to store calculated value
* `=`
* Calculation that we want to perform for each group
* We'll use the function `n` which is a special function that counts the rows in the table

* First argument is the table to work on
* Needs to be a grouped table
* One additional argument for each calculation we want to do for each group
* Column name to store calculated value, `=`, calculation to perform for each group
* We'll use the function `n` which is a special function that counts the rows in the table

```r
counts_by_year <- summarize(surveys_by_year, abundance = n())
Expand All @@ -78,7 +73,7 @@ plot_year_counts <- surveys |>

* We can also do multiple calculations using summarize
* Use any function that returns a single value from a vector.
* Use any function that returns a single value from one or more vectors
* E.g., mean, max, min
* We'll calculate the number of individuals in each plot year combination and their average weight

Expand All @@ -90,7 +85,7 @@ size_abundance_data <- surveys |>

* *Open table*
* Why did we get `NA`?
* `mean(weight)` returns `NA` when `weight` has missing values (`NA`)
* `mean(weight)` returns `NA` when `weight` has missing values (`NA`)
* Can fix using `drop_na(weight)`

```r
Expand Down

0 comments on commit d540ba5

Please sign in to comment.