forked from katiejolly/met-council-R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-reading-in-data.Rmd
97 lines (57 loc) · 2.92 KB
/
05-reading-in-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
# Reading in Data
I am a strong believer in the [cake-first](https://twitter.com/minebocek/status/1072222447473168389) approach to teaching/learning R. It emphasizes real-world examples, interesting data, and visual feedback. Because of that, I like to use ready-made data packages like `fivethirtyeight` and talk about visualization *before* data cleaning.
But I also think reading in data is an important skill so we will talk about that briefly at the end of today, but not spend too much time on it. For now, let's eat the cake instead of going out to get the ingredients. That's more fun anyway.
## Data packages
There are a number of packages in R specifically to make data sharing easier. A few examples are:
* `fivethirtyeight` to share to data used in their articles
* `bikedata` to share data about certain bikeshare systems
* `ecoengine` to share data from the Berkley natural history museum
We will use the data the MN niceride system. I put the data in a package called [metcouncilR](https://www.katiejolly.io/metcouncilR/) so it's easy to use. You can install it by typing `install_github("katiejolly/metcouncilR")` in your console.
### Try it out
In your R markdown, fill in the following code to load your library:
```{r include = FALSE}
library(metcouncilR)
```
```{r eval = FALSE}
___(metcouncilR)
```
To pull a particular dataset from this package, we can use the `data()` function.
```{r}
data("nice_ride_2018") # niceride dataset
```
You should now see it in your global working environment.
I've written documentation for this data that you can see in the help pane in RStudio.
```{r eval = FALSE}
help("nice_ride_2018")
```
There are also a few different ways to get quick summaries of the data.
First, you can check the dimensions to get the number of rows and columns.
```{r}
dim(nice_ride_2018)
```
What function would you use to get just the number of columns? (Google it.)
```{r eval = FALSE}
___(nice_ride_2018)
```
We can also print the first 6 rows of the data with the `head()` function.
```{r}
head(nice_ride_2018)
```
How can we modify this code to print the first **10** rows instead? (hint: `help(head)` to see the documentation)
```{r eval = FALSE}
head(nice_ride_2018, __ = ___)
```
We can also just get a summary of each variable.
```{r}
summary(nice_ride_2018)
```
But you'll notice these aren't that meaningful for the character variables. Another way we can extract information about a variable is to use the `$` operator. To just pull out one variable from a dataset, you would write `data$variable`. We can use this syntax to make a table of the user types.
```{r}
table(nice_ride_2018$usertype)
```
### Practice
1. How many of the users were female?
2. What was the longest trip duration?
3. Looking at the documentation, why might an end station name be empty?
4. Looking at the documentation, what is the unit of the `tripduration` variable?
5. What kinds of trips are excluded from this data?