forked from katiejolly/met-council-R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-wrangling.Rmd
294 lines (178 loc) · 9.34 KB
/
07-wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
# Data Wrangling
We will mostly be using the `tidyr` and `dplyr` packages for wrangling. With those packages, most wrangling is done with 6 "data verbs." These are `select()`, `mutate()`, `filter()`, `arrange()`, `group_by()`, and `summarise()`. One quick note: summarise can be spelled the British or American way (as can any word with dual spellings). Much of the software was developed by people in New Zealand, hence the British spellings. I learned with British spellings so that's what I use, but it works either way.
```{r echo = FALSE}
knitr::include_graphics("https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/dplyr_wrangling.png")
```
We can divide our verbs into 3 categories: row verbs, column verbs, and aggregate verbs.
All of the verbs work in a similar way. The general structure is `verb(data, action)`.
## Row verbs
The two row verbs are `filter()` and `arrange()`. Filter works by taking out rows that we don't want in our data. Arrange works by ordering the rows in some way. We will talk about filter first.
### Filter
Filter takes in boolean statements and returns only the `TRUE` ones. For example, `>`, `<`, and `==` are boolean statements. The double equal is R's way of saying something "is equivalent to."
Let's start by creating a dataframe of only Saturday trips.
```{r}
# create a new dataframe called weekend_trips
saturday_trips <- filter(nice_ride_2018, start_day == "Sat")
dim(saturday_trips) # check that you have these dimensions
```
We can also make a dataframe of trips that ended after 12 noon.
```{r}
after_12 <- filter(nice_ride_2018, end_hour > 12)
dim(after_12)
```
Let's say we want some combination of filters. Maybe we want only the Saturday trips that ended after noon? We will use the `&` operator in between filter actions.
```{r}
saturday_after_12 <- filter(nice_ride_2018, end_hour > 12 & start_day == "Sat")
dim(saturday_after_12)
```
Now let's make a dataframe of all the trips on a weekend day. We will use the `|` (or) operator in between filter actions.
```{r}
weekend_trips <- filter(nice_ride_2018, start_day == "Sat" | start_day == "Sun")
dim(weekend_trips)
```
Let's add Friday to that list.
```{r}
weekendFri_trips <- filter(nice_ride_2018, start_day == "Sat" | start_day == "Sun" | start_day == "Fri")
dim(weekendFri_trips)
```
It gets a little long to write that all out, and this is a pretty simply query. Luckily, R has an `%in%` operator to filter "in bulk."
We would get the same results as above by writing:
```{r}
weekendFri_trips <- filter(nice_ride_2018, start_day %in% c("Sat", "Sun", "Fri"))
dim(weekendFri_trips)
```
### Arrange
Arrange is a way to order your dataset by specific variables.
You can order by just one variable, like duration.
## Practice
1. How many trips were more than 30 minutes long? This is equivalent to 1800 seconds.
2 (a). How many casual trips were there on Wednesdays, Thursdays, and Fridays?
2 (b) . How many casual trips were there on Wednesdays, Thursdays, and Fridays that were 15 minutes or less?
3. How many dockless trips taken by subscribed users were there total?
## Column verbs
Now that we've talked about wrangling the rows, we can think about wrangling the columns. `mutate` is used to create new columns, while `select` is used to narrow down the columns you keep in your table.
### Mutate
```{r echo = FALSE, caption = "Artwork by Allison Horst"}
knitr::include_graphics("images/dplyr_mutate.png")
```
Let's say we want to create a new duration variable that is in minutes instead of seconds. We can divide our old column by 60 to create the new column.
```{r}
nice_ride_2018 <- mutate(nice_ride_2018, tripduration_min = tripduration / 60)
head(nice_ride_2018$tripduration_min)
```
## Practice
The average trip duration in minutes is 21.257 (see code below).
```{r}
summary(nice_ride_2018$tripduration_min)
```
Create a new variation called duration_deviation that calculates how far each trip was from the mean. Your resulting column should look something like this...
```{r echo = FALSE}
nice_ride_2018 %>%
mutate(duration_deviation = tripduration_min - 21.257) %>%
select(duration_deviation) %>%
head()
```
### Select
```{r echo = FALSE}
knitr::include_graphics("https://swcarpentry.github.io/r-novice-gapminder/fig/13-dplyr-fig1.png")
```
## Aggregate verbs
### `%>%`
group_by and summarise are the two main verbs for aggregate statistics, they're similar to Excel pivot tables. In this section we will also talk about the pipe, `%>%`, operator. When doing data wrangling tasks there are often many intermediate tables. Maybe you filter, then create a new column, then calculate some summary statistic. Instead of doing all of these as discrete steps, we can link them to push the data through a workflow instead of stopping along the way. This makes life easier for you because you won't clutter your environment or have to remember what `data3` has that `data4` doesn't.
The way it works is fairly straightforward. You use the `%>%` between data verbs and don't input a new dataset. The operator is telling your verb to use the previously created data.
For example: if you wanted to filter out just Saturday trips and then convert the duration to minutes, you could do:
```{r}
saturday <- filter(nice_ride_2018, start_day == "Sat") %>%
mutate(duration_minutes = tripduration / 60)
```
Which gives you the exact same output as:
```{r}
saturday <- filter(nice_ride_2018, start_day == "Sat")
saturday <- mutate(saturday, duration_minutes = tripduration / 60)
```
With a lot less effort.
### group_by & summarise
`group_by` is how we specify the base unit for aggregate summaries and `summarise` tells us what to calculate.
We can count the number of trips at each start station. Here we will group by the start station because we want counts **per** start station. Then to get a count summary we will use the `n()` function, which counts the number of observations or rows per group.
```{r}
trips_per_station <- nice_ride_2018 %>%
group_by(start_station_name) %>%
summarise(trips = n())
head(trips_per_station)
```
We can also use functions like `sum()`, `mean()`, `min()`, `median()`, etc. For example, we can calculate the mean trip duration per start station.
```{r}
mean_duration <- nice_ride_2018 %>%
group_by(start_station_name) %>%
summarise(mean_duration = mean(tripduration))
head(mean_duration)
```
Then we could also chain a mutate to convert these times to minutes.
```{r}
mean_duration <- nice_ride_2018 %>%
group_by(start_station_name) %>%
summarise(mean_duration = mean(tripduration)) %>%
mutate(mean_duration_min = mean_duration / 60)
head(mean_duration)
```
## Practice
1. Calculate the number of trips for each user type that occurred at each start station. The structure should look like the example below:
Hint: you can use multiple grouping variables
```{r echo = FALSE}
nice_ride_2018 %>%
group_by(start_station_name, usertype) %>%
summarise(trips = n()) %>%
head()
```
## Reshaping verbs
### Spread
`spread()` takes two columns (key & value), and spreads into multiple columns: it makes “long” data wider.
```{r echo = FALSE}
knitr::include_graphics("https://garrettgman.github.io/images/tidy-8.png")
```
We can spread the data so that we have a column for the number of users of different types at each station. First we can group the data to calculate that number though (your answer from above).
```{r}
user_types <- nice_ride_2018 %>%
group_by(start_station_name, usertype) %>%
summarise(trips = n()) %>%
spread(key = "usertype", value = "trips")
head(user_types)
```
### Gather
`gather()` takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.
```{r echo = FALSE}
knitr::include_graphics("https://garrettgman.github.io/images/tidy-9.png")
```
Let's say we got the data in the above form (a variable for each user type) and instead want it in a long format. We can use `gather()` to go the other way.
```{r}
user_types_long <- user_types %>%
gather(key = "usertype", value = "trips", 2:3) # specify which columns you want, this reads "1 to 2"
user_types_long
```
## Practice
* Calculate the number of trips by gender and then spread that dataframe so that you have a column for each of the inputs
* Then gather that back to its original format
## Joins
`dplyr` has two different families of joins. First, there's "joining" joins, which is how we typically think of them
```{r echo = FALSE}
knitr::include_graphics("images/joins1.png")
```
Then there's also filtering joins.
```{r echo = FALSE}
knitr::include_graphics("images/join.png")
```
These are useful when you want to filter the rows of one dataset based on another, but you don't care about the attributes of that second dataset.
The general structure for joins is `join(x, y, c("common variable in x" = "common variable in y"))`. We can join our trips data to the station locations which includes the total number of docks for each station.
First, load the station data.
```{r}
data("station_locations")
head(station_locations)
```
So then if we want to join these by the `start_station_name` variable in our trips dataset, we would type:
```{r paged.print=TRUE}
joined <- left_join(nice_ride_2018, station_locations, by = c("start_station_name" = "name"))
glimpse(joined)
```
We can see at the end that there's now a total_docks variable.
## Practice
* Join the trips and stations on the end_station_name variable