forked from beanumber/etl
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
131 lines (91 loc) · 4.93 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
output: github_document
---
# etl <img src="inst/png/etl_hex.png" align="right" height=140/>
[![Travis-CI Build Status](https://travis-ci.org/beanumber/etl.svg?branch=master)](https://travis-ci.org/beanumber/etl)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/etl)](https://cran.r-project.org/package=etl)
[![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/etl)](http://www.r-pkg.org/pkg/etl)
`etl` is an R package to facilitate [Extract - Transform - Load (ETL)](https://en.wikipedia.org/wiki/Extract,_transform,_load) operations for **medium data**. The end result is generally a populated SQL database, but the user interaction takes place solely within R.
`etl` is on CRAN, so you can install it in the usual way, then load it.
```{r, eval=FALSE}
install.packages("etl")
```
```{r, message=FALSE}
library(etl)
```
Instantiate an `etl` object using a string that determines the class of the resulting object, and the package that provides access to that data. The trivial `mtcars` database is built into `etl`.
```{r, warning=FALSE}
cars <- etl("mtcars")
class(cars)
```
## Connect to a local or remote database
`etl` works with a local or remote database to store your data. Every `etl` object extends a `dplyr::src_dbi` object. If, as in the example above, you do not specify a SQL source, a local `RSQLite` database will be created for you. However, you can also specify any source that inherits from `dplyr::src_dbi`.
> Note: If you want to use a database other than a local RSQLite, you must create the `mtcars` database and have permission to write to it first!
```{r, eval=FALSE}
# For PostgreSQL
library(RPostgreSQL)
db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")
# Alternatively, for MySQL
library(RMySQL)
db <- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost")
cars <- etl("mtcars", db)
```
At the heart of `etl` are three functions: `etl_extract()`, `etl_transform()`, and `etl_load()`.
## Extract
The first step is to acquire data from an online source.
```{r}
cars %>%
etl_extract()
```
This creates a local store of raw data.
## Transform
These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).
```{r}
cars %>%
etl_transform()
```
## Load
Populate the SQL database with the transformed data.
```{r}
cars %>%
etl_load()
```
## Do it all at once
To populate the whole database from scratch, use `etl_create`.
```{r}
cars %>%
etl_create()
```
You can also update an existing database without re-initializing, but watch out for primary key collisions.
```{r, eval=FALSE}
cars %>%
etl_update()
```
## Do Your Analysis
Now that your database is populated, you can work with it as a `src` data table just like any other `dplyr` source.
```{r}
cars %>%
tbl("mtcars") %>%
group_by(cyl) %>%
summarise(N = n(), mean_mpg = mean(mpg))
```
## Create your own ETL packages
Suppose you want to create your own ETL package called `pkgname`. All you have to do is write a package that requires `etl`, and then you have to write **two S3 methods**:
```{r, eval=FALSE}
etl_extract.etl_pkgname()
etl_load.etl_pkgname()
```
Please see the "[Extending etl](https://github.com/beanumber/etl/blob/master/vignettes/extending_etl.Rmd)" vignette for more information.
## Use other ETL packages
- [macleish](https://github.com/beanumber/etl) [![Travis-CI Build Status](https://travis-ci.org/beanumber/macleish.svg?branch=master)](https://travis-ci.org/beanumber/macleish) [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/macleish)](https://cran.r-project.org/package=macleish) :
Weather and spatial data from the MacLeish Field Station in Whately, MA.
- [airlines](https://github.com/beanumber/airlines) [![Travis-CI Build Status](https://travis-ci.org/beanumber/airlines.svg?branch=master)](https://travis-ci.org/beanumber/airlines) : On-time flight arrival data from the Bureau of Transportation Statistics
- [citibike](https://github.com/beanumber/citibike) [![Travis-CI Build Status](https://travis-ci.org/beanumber/citibike.svg?branch=master)](https://travis-ci.org/beanumber/citibike) : Municipal bike-sharing system in New York City
- [nyc311](https://github.com/beanumber/nyc311) [![Travis-CI Build Status](https://travis-ci.org/beanumber/nyc311.svg?branch=master)](https://travis-ci.org/beanumber/nyc311) : Phone calls to New York City's feedback hotline
- [fec](https://github.com/beanumber/fec) [![Travis-CI Build Status](https://travis-ci.org/beanumber/fec.svg?branch=master)](https://travis-ci.org/beanumber/fec) : Campaign contribution data from the Federal Election Commission
- [imdb](https://github.com/beanumber/imdb) [![Travis-CI Build Status](https://travis-ci.org/beanumber/imdb.svg?branch=master)](https://travis-ci.org/beanumber/imdb) : Mirror of the Internet Movie Database
## Cite
Please see [the full manuscript](https://arxiv.org/abs/1708.07073) for additional details.
```{r}
citation("etl")
```