-
Notifications
You must be signed in to change notification settings - Fork 0
/
my-vignette.Rmd
157 lines (111 loc) · 4.29 KB
/
my-vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: "CitiBike"
author: "Wencong (Priscilla) Li (liwencong1995@gmail.com) & Weijia (Vega) Zhang (wzhang23@smith.edu)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{How to use NYC CitiBike Data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include=FALSE, message=FALSE}
library(knitr)
opts_chunk$set(fig.width = 6, fig.height = 4)
```
## ETL
`citibike` is an R package that facilitates Extract - Transform - Load operations for [NYC CitiBike Data](https://www.citibikenyc.com/system-data). `citibike` inherits functions from [etl ](https://github.com/beanumber/etl/blob/master/README.Rmd), which is the parent package. Similarly to `etl`, `citibike` allows the user to pass the date arguments and get a populated SQL database back.
## ETL citiBike
`citibike` is dependent on `etl`. To get started, both packages need to be loaded.
```{r, eval=FALSE, message=FALSE}
install.packages("devtools")
devtools::install_github("beanumber/citibike")
```
```{r, message=FALSE}
library(etl)
library(citibike)
```
Instantiate an `etl` object using a string **citibike** that determines the class of the object. Specifiy the local directory for data storage.
```{r}
bikes <- etl("citibike", dir = "local_path")
class(bikes)
```
## Connect to a local or remote database
`etl` works with a local or remote database, so as `citibike`. If no SQL source has been specified, a local `RSQLite` database will be created for the user. However, the user could also specify the path to an existing database using `dplyr::src_sql`.
## Extract
Extract takes `years` and `months` parameters and allows user to fetch the data of specific date of interest. Note that the data is updated monthly. The default date is the starting month, which is ** July 2013**. If the user enters any date before that, the user will get an error message notification. The user could check the raw folder in the local directory that's created before.
```{r, eval= FALSE}
# default
etl_extract(bikes)
# a specific month
etl_extract(bikes, years = 2014, months = 9)
# duration
etl_extract(bikes, years = 2015, moths = 3:6)
# invalid date will bring up error message
etl_extract(bikes, months = 3:5)
```
## Transform
By default, `etl_transform` takes July 2013 data file and transform the raw data into CSV files. Similar to `etl_extract`, the user could specify the dates. The user could check the load folder of the local directory that has been created.
```{r, eval = FALSE}
bikes %>%
etl_transform()
```
## Load
Import the CSV files into SQL and populate the SQL database with the transformed data.
```{r, eval = FALSE}
bikes %>%
etl_load()
```
## About the dataset
All of the tables contain the following columns:
* Trip Duration (seconds)
* Start Time and Date
* Stop Time and Date
* Start Station Name
* End Station Name
* Station ID
* Station Lat/Long
* Bike ID
* User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)
* Gender (0=unknown; 1=male; 2=female) and Year of Birth.
The user could choose which ones to look at for further analysis.
## Examples
```{r, message=FALSE}
library(lubridate)
library(dplyr)
```
The following code loads citibike data from 2015 January into a SQLite database.
```{r, eval = FALSE}
bikes <- etl("citibike")
bikes %>%
etl_create(years = 2015, months = 2)
trips <- tbl(bikes,"trips")
head(trips, 4)
trips <- trips %>%
collect()
citibike_time <- trips %>%
select(starttime, stoptime) %>%
mutate(starttime = dmy_hm(starttime)) %>%
mutate(stoptime = dmy_hm(stoptime)) %>%
mutate(duration_time = as.numeric(stoptime-starttime))
citibike_time <- citibike_time %>%
filter(duration_time >= 0)
mean(citibike_time$duration_time)
citibike_start <- trips %>%
select(start.station.name, start.station.longitude, start.station.latitude) %>%
rename(longitude = start.station.longitude) %>%
rename(latitude = start.station.latitude)
if (require(leaflet)) {
leaflet(data = citibike_start) %>%
addTiles() %>%
addCircles()
}
citibike_end <- trips %>%
select(end.station.name, end.station.longitude, end.station.latitude) %>%
rename(longitude = end.station.longitude) %>%
rename(latitude = end.station.latitude)
if (require(leaflet)) {
leaflet(data = citibike_end) %>%
addTiles() %>%
addCircles()
}
```