Merge pull request #1070 from ethanwhite/aggregation-cleanup

Aggregation and join lecture cleanup
datacarpentry · Sep 10, 2024 · d540ba5 · d540ba5
2 parents 939b8e2 + 48e2e77
commit d540ba5
Show file tree

Hide file tree

Showing 2 changed files with 39 additions and 31 deletions.
diff --git a/materials/converting-dataframes-vectors.md b/materials/converting-dataframes-vectors.md
@@ -23,7 +23,8 @@ download.file("https://www.datacarpentry.org/semester-biology/data/shrub-volume-
 
 * These two ways of storing data are related to one another
 * A data frame is a bunch of equal length vectors that are grouped together
-* So, we can extract vectors from data frames and we can also make data frames from vectors
+* This is why when using `mutate()` and `summarize()` we can use any function that works on a vector
+* As a result we can extract vectors from data frames and make data frames from vectors
 
 ### Extracting vectors from data frames
 
@@ -35,20 +36,10 @@ download.file("https://www.datacarpentry.org/semester-biology/data/shrub-volume-
 surveys <- read_csv("species.csv")
 ```
 
-* One common approach to extracting a column into a vector is to use `$`
-* The `$` in R is short hand for `[[]]` in cases where the piece we want to get has a name
-* So, we start with the object we want a part of, our `surveys` data frame
-* Then the `$` with no spaces around it
-* and then the name of the `species_id` column (without quotes, just to be confusing)
-
-```r
-species$species_id
-```
-
-* We can also do this using `[]`
-* Remember that `[]` also mean "give me a piece of something"
+* We do this using `[]`
+* Remember that `[]` also mean "give me a piece of something" in R
 * Let's get the `species_id` column
-* `"species_id"` has to be in quotes because we we aren't using `dplyr`
+* `"species_id"` has to be in quotes because we we aren't using the tidyverse
 
 ```r
 species["species_id"]
@@ -62,6 +53,16 @@ species["species_id"]
 species[["species_id"]]
 ```
 
+* We can also use the `$`
+* Shorthand for `[[]]` in cases where the piece of something we want to get has a name
+* So, we start with the object we want a part of, our `surveys` data frame
+* Then the `$` with no spaces around it
+* and then the name of the `species_id` column (without quotes, just to be confusing)
+
+```r
+species$species_id
+```
+
 * Finally, `dplyr` has a function called `pull()`
 
 ```r
@@ -76,6 +77,8 @@ species |>
   pull(species_id)
 ```
 
+> Do [Extracting vectors from data frames]({{ site.baseurl }}/exercises/extracting-vectors-from-data-frames-R/).
+
 ### Combining vectors to make a data frame
 
 * We can also combine vectors to make a data frame
@@ -91,12 +94,23 @@ area <- c(3, 5, 1.9, 2.7)
 count_data <- data.frame(states = states, counts = count, regional_area = area)
 ```
 
-* We can also add columns to the data from that only include a single value without first creating a vector
+* To make a tibble instead of a data.frame use `tibble()`
+
+```r
+library(dplyr)
+
+count_data <- tibble(states = states, counts = count, regional_area = area)
+```
+
+* `tibble()` is part of the `tibble` package, which gets loaded by `dplyr`
+* If you want to use it without loading `dplyr` you can load `tibble` directly
+
+* We can also add columns to the data that only include a single value without first creating a vector
 * We do this by providing a name for the new column, an equals sign, and the value that we want to occur in every row
 * For example, if all of this data was collected in the same year and we wanted to add that year as a column in our data frame we could do it like this
 
 ```r
-count_data_year <- data.frame(year = 2022, states = states, counts = count, regional_area = area)
+count_data_year <- tibble(year = 2022, states = states, counts = count, regional_area = area)
 ```
 
 * `year =` sets the name of the column in the data frame
@@ -119,5 +133,4 @@ count_data_year_elev <- mutate(count_data_year, elevations = elevation)
 * We can combine vectors into data frames using the `data.frame` function, which takes a series of arguments, one vector for each column we want to create in the data frame.
 
 
-> Do [Extracting vectors from data frames]({{ site.baseurl }}/exercises/extracting-vectors-from-data-frames-R/).
 > Do [Building data frames from vectors]({{ site.baseurl }}/exercises/building-data-frames-from-vectors-R/).
diff --git a/materials/dplyr-aggregation.md b/materials/dplyr-aggregation.md
@@ -33,10 +33,7 @@ surveys <- read_csv("surveys.csv")
 group_by(surveys, year)
 ```
 
-* Different looking kind of `data.frame`
-* Called a tibble
-* Sometimes produced by `dplyr` functions
-* Source, grouping, and data type information
+* The tibble produced by this function has grouping information
 * Store the data frame in a variable to use in the next step
 
 ```r
@@ -45,14 +42,12 @@ surveys_by_year <- group_by(surveys, year)
 
 * After grouping a data frame use `summarize()` to calculate values for each group.
 * Count the number of rows for each group (individuals in each species).
-* `summarize`
-* Arguments
-  * Table to work on, which needs to be a grouped table
-  * One additional argument for each calculation we want to do for each group
-    * New column name to store calculated value
-    * `=`
-    * Calculation that we want to perform for each group
-    * We'll use the function `n` which is a special function that counts the rows in the table
+
+* First argument is the table to work on
+* Needs to be a grouped table
+* One additional argument for each calculation we want to do for each group
+* Column name to store calculated value, `=`, calculation to perform for each group
+* We'll use the function `n` which is a special function that counts the rows in the table
 
 ```r
 counts_by_year <- summarize(surveys_by_year, abundance = n())
@@ -78,7 +73,7 @@ plot_year_counts <- surveys |>
 
 
 * We can also do multiple calculations using summarize
-* Use any function that returns a single value from a vector.
+* Use any function that returns a single value from one or more vectors
 * E.g., mean, max, min
 * We'll calculate the number of individuals in each plot year combination and their average weight
 
@@ -90,7 +85,7 @@ size_abundance_data <- surveys |>
 
 * *Open table*
 * Why did we get `NA`?
-    * `mean(weight)` returns `NA` when `weight` has missing values (`NA`)
+* `mean(weight)` returns `NA` when `weight` has missing values (`NA`)
 * Can fix using `drop_na(weight)`
 
 ```r