-
Notifications
You must be signed in to change notification settings - Fork 55
/
contributing.Rmd
286 lines (205 loc) · 19.2 KB
/
contributing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
```{r include=FALSE}
source("r/render.R")
knitr::opts_chunk$set(eval = FALSE)
```
# Contributing {#contributing}
> Hold the door, hold the door.
>
> --- Hodor
In [Chapter 12](#streaming), we equipped you with the tools to tackle large-scale and real-time data processing in Spark using R. In this final chapter we focus less on learning and more on giving back to the Spark and R communities or colleagues in your professional career. It really takes an entire community to keep this going, so we are counting on you!
There are many ways to contribute, from helping community members and opening GitHub issues to providing new functionality for yourself, colleagues, or the R and Spark community. However, we'll focus here on writing and sharing code that extends Spark, to help others use new functionality you can provide as an author of Spark extensions using R. Specifically, you'll learn what an extension is, the different types of extensions you can build, what building tools are available, and when and how to build an extension from scratch.
You will also learn how to make use of the hundreds of extensions available in Spark and the millions of components available in Java that can easily be used in R. We'll also cover how to create code natively in Scala that makes use of Spark. As you might know, R is a great language for interfacing with other languages, such as C++, SQL, Python, and many others. It's no surprise, then, that working with Scala from R will follow similar practices that make R ideal for providing easy-to-use interfaces that make data processing productive and that are loved by many of us.
## Overview {#contributing-overview}
When<!--((("contributing", "overview of")))--> you think about giving back to your larger coding community, the most important question you can ask about any piece of code you write is: would this code be useful to someone else?
Let's start by considering one of the first and simplest lines of code presented in this book. This code was used to load a simple CSV file:
```{r contributing-read, eval=FALSE}
spark_read_csv(sc, "cars.csv")
```
Code this basic is probably not useful to someone else. However, you could tailor that same example to something that generates more interest, perhaps the following:
```{r contributing-read-useful, eval=FALSE}
spark_read_csv(sc, "/path/that/is/hard/to/remember/data.csv")
```
This code is quite similar to the first. But if you work with others who are working with this dataset, the answer to the question about usefulness would be yes—this would very likely be useful to someone else!
This is surprising since it means that not all useful code needs to be advanced or complicated. However, for it to be useful to others, it does need to be packaged, presented, and shared in a format that is easy to consume.
A first attempt would be to save this into a _teamdata.R_ file and write a function wrapping it:
```{r contributing-useful-function, eval=FALSE}
load_team_data <- function() {
spark_read_text(sc, "/path/that/is/hard/to/remember/data.csv")
}
```
This is an improvement, but it would require users to manually share this file again and again. Fortunately, this<!--((("packages")))((("R Packages")))--> problem is easily solved in R, through _R packages_.
An<!--((("commands", "install.packages()")))--> R package contains R code in a format that is installable using the function `install.packages()`. One example is `sparklyr`. There are many other R packages available; you can also create your own. For those of you new to creating them, we encourage you to read Hadley Wickham’s book, [_R Packages_](http://shop.oreilly.com/product/0636920034421.do) (O'Reilly). Creating an R package allows you to easily share your functions with others by sharing the package file in your organization.
Once you create a package, there are many ways of sharing it with colleagues or the world. For instance, for packages meant to be private, consider using [Drat](http://bit.ly/2N8J1f7) or products like [RStudio Package Manager](http://bit.ly/2H5k807). R packages meant for public consumption are made available to the R community in [CRAN](https://cran.r-project.org/) (the Comprehensive R Archive Network).
These repositories of R packages allow users to install packages through `install.packages("teamdata")` without having to worry where to download the package from. It also allows other packages to reuse your package.
In addition to using R packages like `sparklyr`, `dplyr`, `broom`, and others to create new R packages that extend Spark, you can also use all the functionality available in the Spark API or Spark Extensions or write custom Scala code.
For instance, suppose that there is a new file format similar to a CSV but not quite the same. We might want to write a function named `spark_read_file()` that would take a path to this new file type and read it in Spark. One approach would be to use `dplyr` to process each line of text or any other R library using `spark_apply()`. Another would be to use the Spark API to access methods provided by Spark. A third approach would be to check whether someone in the Spark community has already provided a Spark extension that supports this new file format. Last but not least, you could write your own custom Scala code that makes use of any Java library, including Spark and its extensions. This is shown in Figure \@ref(fig:contributing-types-of-extensions).
```{r contributing-types-of-extensions, eval=TRUE, echo=FALSE, fig.cap='Extending Spark using the Spark API or Spark extensions, or writing Scala code', fig.align='center', out.height = '340pt', out.width = 'auto'}
render_nomnoml('
#direction: right
[R|
[<note> spark_read_file("path")]
]->[R Package|
[<note> spark_read_file <- function(path) {
invoke_static("FileReader", "read", path)
}]]
[R Package]->[Spark API|
[<note> package org.apache.spark
class FileReader]]
[R Package]->[Spark Extension|
[<note> package spark.extension
class FileReader]]
[R Package]->[Scala Code |
[<note> package scala.extension
class FileReader:
def read = {}]]
', "images/contributing-types-of-extensions.png")
```
We will focus first on extending Spark using the Spark API, since the techniques required to call the Spark API are also applicable while calling Spark extensions or custom Scala code.
## Spark API {#contributing-spark-api}
Before<!--((("contributing", "Spark API")))((("Spark API")))--> we introduce the Spark API, let’s consider a simple and well-known problem. Suppose we want to count the number of lines in a distributed and potentially large text file—say, _cars.csv_, which we initialize as follows:
```{r contributing-prepare-count-lines}
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.3")
cars <- copy_to(sc, mtcars)
spark_write_csv(cars, "cars.csv", mode = "overwrite")
```
Now, to count how many lines are available in this file, we can run the following:
```{r contributing-count-lines}
spark_read_text(sc, "cars.csv") %>% count()
```
```
# Source: spark<?> [?? x 1]
n
<dbl>
1 33
```
Easy enough: we used `spark_read_text()` to read the entire text file, followed by counting lines using `dplyr’s` `count()`. Now, suppose that neither `spark_read_text()`, nor `dplyr`, nor any other Spark functionality, is available to you. How would you ask Spark to count the number of rows in _cars.csv_?
If you do this in Scala, you find in the Spark documentation that by using the Spark API you can count lines in a file as follows:
```{scala contributing-scala-count-lines, eval=FALSE}
val textFile = spark.read.textFile("cars.csv")
textFile.count()
```
So, to use the functionality available in the Spark API from R, like `spark.read.textFile`, you can use `invoke()`, `invoke_static()`, or `invoke_new()`. (As their names suggest, the first invokes a method from an object, the second invokes a method from a static object, and the third creates a new object.) We then use these functions to call Spark’s API and execute code similar to the one provided in Scala:
```{r contributing-func-count-lines}
spark_context(sc) %>%
invoke("textFile", "cars.csv", 1L) %>%
invoke("count")
```
```
[1] 33
```
While the `invoke()` function was originally designed to call Spark code, it can also call any code available in Java. For instance, we can create a Java `BigInteger` with the following code:
```{r}
invoke_new(sc, "java.math.BigInteger", "1000000000")
```
```
<jobj[225]>
java.math.BigInteger
1000000000
```
As you can see, the object created is not an R object but rather a proper Java object. In R, this Java object is represented by the `spark_jobj`. These objects are meant to be used with the `invoke()` functions or `spark_dataframe()` and `spark_connection()`. `spark_dataframe()` transforms a `spark_jobj` into a Spark DataFrame when possible, whereas `spark_connect()` retrieves the original Spark connection object, which can be useful to avoid passing the `sc` object across functions.
While calling the Spark API can be useful in some cases, most of the functionality available in Spark is already supported in `sparklyr`. Therefore, a more interesting way to extend Spark is by using one of its many existing extensions.
```{r echo=FALSE}
spark_disconnect(sc)
```
## Spark Extensions
Before<!--((("contributing", "Spark extensions")))((("extensions", "contributing to")))--> we get started with this section, consider navigating to [spark-packages.org](https://spark-packages.org/), a site that tracks Spark extensions provided by the Spark community. Using the same techniques presented in the previous section, you can use these extensions from R.
For<!--((("Apache Solr")))((("Solr")))--> instance, there is [Apache Solr](http://bit.ly/2MmBfim), a system designed to perform full-text searches over large datasets, something that Apache Spark currently does not support natively. Also, as of this writing, there is no extension for R to support Solr. So, let’s try to solve this by using a Spark extension.
First, if you search "spark-packages.org" to find a Solr extension, you should be able to locate [`spark-solr`](http://bit.ly/2YQnwXw). The "How to" extension mentions that the `com.lucidworks.spark:spark-solr:2.0.1` should be loaded. We can accomplish this in R using the `sparklyr.shell.packages` configuration option:
```{r eval=FALSE}
config <- spark_config()
config["sparklyr.shell.packages"] <- "com.lucidworks.spark:spark-solr:3.6.3"
config["sparklyr.shell.repositories"] <-
"http://repo.spring.io/plugins-release/,http://central.maven.org/maven2/"
sc <- spark_connect(master = "local", config = config, version = "2.3")
```
While specifying the `sparklyr.shell.packages` parameter is usually enough, for this particular extension, dependencies failed to download from the Spark packages repository. You would need to manually find the failed dependencies in the [Maven repo](https://mvnrepository.com) and add further repositories under the `sparklyr.shell.repositories` parameter.
**Note:** When you use an extension, Spark connects to the Maven package repository to retrieve it. This can take significant time depending on the extension and your download speeds. In this case, consider using the `sparklyr.connect.timeout` configuration parameter to allow Spark to download the required files.
From the `spark-solr` documentation, you would find that this extension can be used with the following Scala code:
```{scala eval=FALSE}
val options = Map(
"collection" -> "{solr_collection_name}",
"zkhost" -> "{zk_connect_string}"
)
val df = spark.read.format("solr")
.options(options)
.load()
```
We can translate this to R code:
```{r eval=FALSE}
spark_session(sc) %>%
invoke("read") %>%
invoke("format", "solr") %>%
invoke("option", "collection", "<collection>") %>%
invoke("option", "zkhost", "<host>") %>%
invoke("load")
```
This code will fail, however, since it would require a valid Solr instance and configuring Solr goes beyond the scope of this book. But this example provides insights as to how you can create Spark extensions. It’s also worth mentioning that you can use `spark_read_source()` to read from generic sources to avoid writing custom `invoke()` code.
As pointed out in [Overview](#contributing-overview), consider sharing code with others using R packages. While you could require users of your package to specify `sparklyr.shell.packages`, you can avoid this by registering dependencies in your R package. Dependencies are declared under a `spark_dependencies()` function; thus, for the example in this section:
```{r eval=FALSE}
spark_dependencies <- function(spark_version, scala_version, ...) {
spark_dependency(
packages = "com.lucidworks.spark:spark-solr:3.6.3",
repositories = c(
"http://repo.spring.io/plugins-release/",
"http://central.maven.org/maven2/")
)
}
.onLoad <- function(libname, pkgname) {
sparklyr::register_extension(pkgname)
}
```
The `onLoad` function is automatically called by R when your library loads. It should call `register_extension()`, which will then call back `spark_dependencies()`, to allow your extension to provide additional dependencies. This example supports Spark 2.4, but you should also support mapping Spark and Scala versions to the correct Spark extension version.
There are about 450 Spark extensions you can use; in addition, you can also use any Java library from a [Maven repository](http://bit.ly/2Mp0wrR), where Maven Central has over 3 million artifacts. While not all Maven Central libraries might be relevant to Spark, the combination of Spark extensions and Maven repositories certainly opens many interesting possibilities for you to consider!
However, for those cases where no Spark extension is available, the next section will teach you how to use custom Scala code from your own R package.
## Scala Code
Scala<!--((("contributing", "using Scala code")))((("Scala")))--> code enables you to use any method in the Spark API, Spark extensions, or Java library. In addition, writing Scala code when running in Spark can provide performance improvements over R code using `spark_apply()`. In general, the structure of your R package will contain R code and Scala code; however, the Scala code will need to be compiled as JARs (Java ARchive files) and included in your package. Figure \@ref(fig:contributing-scala-code) shows this structure.
```{r contributing-scala-code, eval=TRUE, echo=FALSE, fig.cap='R package structure when using Scala code', fig.align='center', out.height = '200pt', out.width = 'auto'}
render_nomnoml("
#direction: right
[R Package|
[R Code|
[<note>invoke()
invoke_static()
invoke_new()]]-[Scala Code]
[Scala Code|
[<note>package scala.extension
class YourExtension
def yourMethod = {}]]-[JARs]
[JARs|
[<note>extension-spark-1.6.jar
extension-spark-2.0.jar
extension-spark-2.4.jar]]
]", "images/contributing-scala-code.png")
```
As usual, the R code should be placed under a top-level _R_ folder, Scala code under a _java_ folder, and the compiled JARs under an _inst/java_ folder. Though you are certainly welcome to manually compile the Scala code, you can use helper functions to download the required compiler and compile Scala code.
To compile Scala code, you'll need to install the Java Development Kit 8 (JDK8, for short). Download the JDK from [Oracle's Java download page](http://bit.ly/2P2UkYM); this will require you to restart your R session.
You'll also need a [Scala compiler for Scala 2.11 and 2.12](https://www.scala-lang.org/). The Scala compilers can be automatically downloaded and installed using `download_scalac()`:
```{r contributing-download-scalac, eval = FALSE}
download_scalac()
```
Next, you'll need to compile your Scala sources using `compile_package_jars()`. By default, it uses `spark_compilation_spec()`, which compiles your sources for the following Spark versions:
```{r contributing-scala-code-spec, eval = FALSE, echo = FALSE}
sapply(sparklyr::spark_default_compilation_spec(), function(e) e$spark_version)
```
```
[1] "1.5.2" "1.6.0" "2.0.0" "2.3.0" "2.4.0"
```
You can also customize this specification by creating custom entries with `spark_compilation_spec()`.
While you could create the project structure for the Scala code from scratch, you can also simply call _spark_extension(path)_ to create an extension in the given path. This extension will be mostly empty but will contain the appropriate project structure to call the Scala code.
Since `spark_extension()` is registered as a custom project extension in RStudio, you can also create an R package that extends Spark using Scala code from the File menu; click New Project and then select "R Package using Spark" as shown in Figure \@ref(fig:contributing-r-rstudio-project).
```{r contributing-r-rstudio-project, eval=TRUE, fig.align='center', echo=FALSE, fig.cap='Creating a Scala extension package from RStudio'}
render_image("images/contributing-r-rstudio-project.png")
```
Once you are ready to compile your package JARs, you can simply run the following:
```{r eval=FALSE}
compile_package_jars()
```
Since the JARs are compiled by default into the _inst/_ package path, when you are building the R package all the JARs will also be included within the package. This means that you can share or publish your R package, and it will be fully functional by R users. For advanced Spark users with most of their expertise in Scala, it's compelling to consider writing libraries for R users and the R community in Scala and then easily packaging these into R packages that are easy to consume, use, and share among them.
If you are interested in developing a Spark extension with R and you get stuck along the way, consider joining the `sparklyr` [Gitter channel](http://bit.ly/33ESccY), where many of us hang out to help this wonderful community to grow. We hope to hear from you soon!
## Recap
This final chapter introduced you to a new set of tools to use to expand Spark functionality beyond what R and R packages currently support. This vast new space of libraries includes more than 450 Spark extensions and millions of Java artifacts you can use in Spark from R. Beyond these resources, you also learned how to build Java artifacts using Scala code that can be easily embedded and compiled from R.
This brings us back to this book's purpose, presented early on; while we know you’ve learned how to perform large-scale computing using Spark in R, we're also confident that you have acquired the knowledge required to help other community members through Spark extensions. We can’t wait to see your new creations, which will surely help grow the Spark and R communities at large.
To close, we hope the first chapters gave you an easy introduction to Spark and R. Following that, we presented analysis and modeling as foundations for using Spark, with the familiarity of R packages you already know and love. You moved on to learning how to perform large-scale computing in proper Spark clusters. The last third of this book focused on advanced topics: using extensions, distributing R code, processing real-time data, and, finally, giving back by using Spark extensions using R and Scala code.
We've tried to present the best possible content. However, if you know of any way to improve this book, please open a GitHub issue under the [the-r-in-spark](http://bit.ly/2HdkIZQ) repo, and we'll address your suggestions in upcoming revisions. We hope you enjoyed reading this book, and that you’ve learned as much as we have learned while writing it. We hope it was worthy of your time—it has been an honor having you as our reader.