We’ll manipulate our barplots and add more information using factors.
Here’s the dataset we’ll use to investigate how to work with factors in ggplot2.
-
#| edit: false
-pets
+
+
+
+
+
+
3.1.1 Exercise
@@ -292,11 +307,17 @@
#| exercise: ex_1
-##use glimpse here
-glimpse(----)
-
-
Solution.
+
+
+
+
+
+
+
+
+
@@ -309,8 +330,8 @@
-
##use glimpse here
-glimpse(pets)
+
##use glimpse here
+glimpse(pets)
@@ -328,16 +349,16 @@
3.2.1 Exercise
-
#| exercise: ex_2
-##Show a barplot and count by name and fill by animal
-##theme() allows us to angle the text labels so that we can read them
-ggplot(pets, aes(x= -----)) + geom_bar() +
- ##We make the x axis text angled
- ##for better legibility
- theme(axis.text.x = element_text(angle=45))
-
-
-
Solution.
+
+
+
+
+
+
+
+
@@ -350,12 +371,12 @@
-
##show a barplot and count by name and fill by animal
-##theme() allows us to angle the text labels so that we can read them
-ggplot(pets, aes(x=name)) +geom_bar() +
-##we make the x axis text angled
-##for better legibility
-theme(axis.text.x =element_text(angle=45))
+
##show a barplot and count by name and fill by animal
+##theme() allows us to angle the text labels so that we can read them
+ggplot(pets, aes(x=name)) +geom_bar() +
+##we make the x axis text angled
+##for better legibility
+theme(axis.text.x =element_text(angle=45))
@@ -368,12 +389,17 @@
Map shotsCurrent to the fill aesthetic.
3.3.1 Exercise
-
#| exercise: ex_3
-#map the right variable in pets to fill
-ggplot(pets, aes(x=animal, fill= ----)) +
- geom_bar()
-
-
Solution.
+
+
+
+
+
+
+
+
+
@@ -386,9 +412,9 @@
-
#map the right variable in pets to fill
-ggplot(pets, aes(x=animal, fill=shotsCurrent)) +
-geom_bar()
+
#map the right variable in pets to fill
+ggplot(pets, aes(x=animal, fill=shotsCurrent)) +
+geom_bar()
@@ -398,27 +424,27 @@
3.4 Quick Quiz
What does mapping color to "black" in geom_bar() do? For example:
Say you have another factor variable and you want to stratify the plots based on that. You can do that by supplying the name of that variable as a facet. Here, we facet our barplot by shotsCurrent.
-
#| edit: false
-ggplot(data=pets, mapping=aes(x=name)) + geom_bar() +
- ##have to specify facets using "~" notation
- facet_wrap(facets=~shotsCurrent) +
- ##we make the x axis x angled for better legibility
- theme(axis.text.x = element_text(angle=45))
+
+
+
+
+
+
You might notice that there are blank spots for the categories in each facet. We can remove these in each facet by using scale="free_x" argument in facet_wrap().
3.7.1 Exercise
Add free_x to the scale argument. How many animals named “Morris” did not receive shots?
Is the proportion of animals receiving shots the same across each age category?
Think about what to map to x, and what to map to fill, and what position argument you need for geom_bar(). Finally, think about how to facet the variable.
We’ve been looking at datasets that fit the ggplot2 paradigm nicely; however, most data we encounter is really messy (missing values), or is a completely different format.
In this chapter, we’ll look at one of the most powerful tools in the tidyverse: dplyr, which lets you manipulate data frames.
@@ -280,12 +297,25 @@
4dplyr cheat sheet
Also, remember: if you need to know the variables in a data.frame called biopics you can always use
-
#| edit: false
-colnames(biopics)
+
+
+
+
+
+
If you want more information on a function such as mutate(), you can always ask for help:
-
?mutate
+
+
+
+
+
+
Move on to the next exercise!
@@ -293,9 +323,9 @@
sumOfTwoNumbers <-1+2
+
sumOfTwoNumbers <-1+2
Once we have something assigned to a variable, we can use it in other expressions:
-
sumOfThreeNumbers <- sumOfTwoNumbers +3
+
sumOfThreeNumbers <- sumOfTwoNumbers +3
This is the bare basics of assignment. We’ll use it in the next exercises to evaluate the output of our dplyr cleaning.
##assign newValue
-newValue <-10
-## use newValue to calculate multValue
-multValue <- newValue *5
-##show multValue
-multValue
+
##assign newValue
+newValue <-10
+## use newValue to calculate multValue
+multValue <- newValue *5
+##show multValue
+multValue
@@ -350,14 +380,16 @@
Use the levels() function to count the categories.
-
#| exercise: ex_2
-##run summary() here on biopics
-summary(-----)
-##show length of country categories here
-length(levels(biopics$------))
-
-
-
Solution.
+
+
+
+
+
+
+
+
@@ -370,10 +402,10 @@
-
##run summary here
-summary(biopics)
-##show length of country categories here
-length(levels(biopics$country))
+
##run summary here
+summary(biopics)
+##show length of country categories here
+length(levels(biopics$country))
@@ -385,12 +417,14 @@
filter() is a very useful dplyr command. It allows you to subset a data.frame based on variable criteria.
For example, if we wanted to subset biopics to those movies that were made in the UK we’d use the following statement:
-
#| edit: false
-#| echo: true
-#subset the data using filter
-biopicsUK <- filter(biopics, country=="UK")
-#confirm that we have subsetted correctly
-biopicsUK
+
+
+
+
+
+
Three things to note here:
@@ -405,15 +439,16 @@
Show how many rows are left using nrow(crimeMovies).
-
#| exercise: ex_3
-#add your filter statement here
-crimeMovies <- filter(------)
+
+
-#show number of crime movies
-nrow(------)
-
-
Solution.
+
+
+
+
@@ -426,10 +461,10 @@
-
#add your filter statement here
-crimeMovies <-filter(biopics, type_of_subject =="Criminal")
-#show number of crime movies
-nrow(crimeMovies)
+
#add your filter statement here
+crimeMovies <-filter(biopics, type_of_subject =="Criminal")
+#show number of crime movies
+nrow(crimeMovies)
Show how many rows are left from your filter() statement.
-
#| exercise: ex_4
-#add your comparison to the end of this filter statement
-crimeFilms <- filter(biopics, year_release > 1980 &
- type_of_subject == "Criminal" &
- ------ == ------
- )
-
-#show number of rows in crimeFilms
-nrow(------)
+
+
+
+
+
-
-
Solution.
+
+
@@ -482,14 +520,14 @@
-
#add your comparison to the end of this filter statement
-crimeFilms <- filter(biopics, year_release > 1980 &
- type_of_subject == "Criminal" &
- person_of_color == FALSE
- )
+
+
-#show number of rows in crimeFilms
-nrow(crimeFilms)
+
+
+
@@ -501,25 +539,28 @@
4.5 Quick Quiz about Chaining Comparisons
Which statement should be the larger subset? Try them out in the console if you’re not sure.
What if you wanted to select for multiple values? You can use the %in% operator. Here we put the values into a vector with the c() function, which concatentates the values together into a form that R can manipulate. Note that these values have to be exact and the case has to be the same (that is, “UK”, not “Uk” or “uk”) for the matching to work.
One trick you can use filter() for is to remove missing values. Usually missing values are coded as NA in data. You can remove rows that contain NAs by using is.na(). For example:
Note the ! in front of is.na(box_office). This ! is known as the NOT operator. Basically, it switches the values in our is.na statement, making everything that was TRUE into FALSE, and everything FALSE into TRUE. We want to keep everything that is not NA, so that’s why we use the !.
@@ -586,15 +640,16 @@
How many missing values did we remove?
-
#| exercise: ex_6
-filteredBiopics <- filter(--------, -------)
-#show number of rows in biopics
-nrow(biopics)
-#show number of rows in filteredBiopics
-nrow(filteredBiopics)
-
-
-
Solution.
+
+
+
+
+
+
+
+
@@ -608,11 +663,14 @@
-
filteredBiopics <- filter(biopics, !is.na(box_office))
-#show number of rows in biopics
-nrow(biopics)
-#show number of rows in filteredBiopics
-nrow(filteredBiopics)
+
+
+
+
+
+
@@ -624,10 +682,14 @@
4.8dplyr::mutate()
mutate() is one of the most useful dplyr commands. You can use it to transform data (variables in your data.frame) and add it as a new variable into the data.frame. For example, let’s calculate the total box_office divided by the number_of_subjects to normalize our comparison as normalized_box_office:
What did we do here? First, we used the mutate() function to add a new column into our data.frame called normalized_box_office. This new variable is calculated per row by dividing box_office by number_of_subjects.
@@ -638,15 +700,16 @@
Remember, you can use the paste() function to paste two strings together.
-
#| exercise: ex_7
-#assign new variable race_and_gender here using mutate()
-biopics2 <- mutate()
+
+
-#show first rows of biopics2 using head()
-head(biopics2)
-
-
Solution.
+
+
+
+
@@ -659,10 +722,10 @@
-
#assign new variable race_and_gender here using mutate()
-biopics2 <-mutate(biopics, race_and_gender =paste(subject_race, subject_sex))
-#show first rows of biopics2 using head()
-head(biopics2)
+
#assign new variable race_and_gender here using mutate()
+biopics2 <-mutate(biopics, race_and_gender =paste(subject_race, subject_sex))
+#show first rows of biopics2 using head()
+head(biopics2)
@@ -673,10 +736,14 @@
4.9 You can use mutated variables right away!
The nifty thing about mutate() is that once you define the variables in the statement, you can use them right away, in the same mutate statement. For example, look at this code:
Notice that we first defined box_office_year in the first part of the mutate() statement, and then used it right away to define a new variable, box_office_subject.
@@ -687,16 +754,16 @@
Hint: Add box_office_y_s_num=box_office_year/number_of_subjects to the statement below.
checkdown::check_question("We are defining a brand-new variable with the same name in our dataset and keeping the old variable as well", options =c(
-"We are defining a brand-new variable with the same name in our dataset and keeping the old variable as well", "We are processing the variable `subject` and saving it in place"
-))
We’re going to introduce another bit of dplyr syntax, the %>% operator. %>% is called a pipe operator.
You can think of it as being similar to the + in a ggplot2 statement.
What %>% does is that it takes the output of one statement and makes it the input of the next statement. When I’m describing it, I think of it as a “THEN”. For example, I read the following expression
As: - I took the biopics data, - THEN I filtered it down with the race_known == "Known" criteria and - THEN I defined a new variable called poc_code with mutate().
Note that filter() doesn’t have a data argument, because the data is piped into filter(). Same thing for mutate().
%>% allows you to chain multiple verbs in the tidyverse. It’s one of the most powerful things about the tidyverse.
@@ -292,14 +305,26 @@
-
biopics %>%
- filter(-----)
+
+
+
+
+
+
-
biopics %>%
- filter(country == "US")
+
+
+
+
+
+
@@ -310,12 +335,14 @@
5.2group_by()/summarize()
group_by() doesn’t do anything by itself. But when combined with summarize(), you can calculate metrics (such as mean, max - the maximum, min, sd - the standard deviation) across groups. For example:
Here we want to calculate the mean box_office by country. However, in order to do that, we first need to remove any rows that have NA values in box_office that may confound our calculation.
Let’s ask a tough question. Is there a difference between mean box_office between the two subject_sex categories?
@@ -327,24 +354,26 @@
-
gender_box_office <- biopics %>%
- filter() %>%
- group_by() %>%
- summarize(mean_bo_by_gender= mean(--------))
-
-##show head of gender_box_office
-head(gender_box_office)
checkdown::check_question("counts each `type_of_subject` and puts it in another table",
- options=c("just shows the regular `biopics` `data.frame`",
- "counts each `type_of_subject` and puts it in another table"
- ))
-
+
+
+
+
+
+
5.4arrange()
arrange() lets you sort by a variable. If you provide multiple variables, the variables are arranged within each other. For example:
Note that we use %>% to pipe our statement into the ggplot() function. The tricky thing to remember is that everything after the ggplot() is connected with +, and not %>%.
Also note: we don’t assign a data variable in the ggplot() statement. We are piping in the data.