-
Notifications
You must be signed in to change notification settings - Fork 55
/
clusters.Rmd
509 lines (354 loc) · 47.6 KB
/
clusters.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
```{r include=FALSE, eval=TRUE}
knitr::opts_chunk$set(eval = FALSE)
source("r/render.R")
library(ggplot2)
```
# Clusters {#clusters}
> I have a very large army and very large dragons.
>
> --- Daenerys Targaryen
Previous chapters focused on using Spark over a single computing instance, your personal computer. In this chapter, we introduce techniques to run Spark over multiple computing instances, also known as a _computing_ cluster. This chapter and subsequent ones will introduce and make use of concepts applicable to computing clusters; however, it’s not required to use a computing cluster to follow along, so you can still use your personal computer. It’s worth mentioning that while previous chapters focused on single computing instances, you can also use all the data analysis and modeling techniques we presented in a computing cluster without changing any code.
If you already have a Spark cluster in your organization, you could consider skipping to [Chapter 7](#connections), which teaches you how to connect to an existing cluster. Otherwise, if you don’t have a cluster or are considering improvements to your existing infrastructure, this chapter introduces the cluster trends, managers, and providers available today.
## Overview {#clusters-overview}
There<!--((("clusters", "overview of", id="Cover06")))--> are three major trends in cluster computing worth discussing: _on-premises_, _cloud_ computing, and _Kubernetes_. Framing these trends over time will help us understand how they came to be, what they are, and what their future might be. To illustrate this, Figure \@ref(fig:clusters-trends) plots these trends over time using data from Google trends.
For on-premises clusters, you or someone in your organization purchased physical computers that were intended to be used for cluster computing. The computers in this cluster are<!--((("off-the-shelf hardware")))((("high-performance hardware")))--> made of _off-the-shelf_ hardware, meaning that someone placed an order to purchase computers usually found on store shelves, or _high-performance_ hardware, meaning that a computing vendor provided highly customized computing hardware, which also comes optimized for high-performance network connectivity, power consumption, and so on.
```{r clusters-trends-code, echo=FALSE}
library(dplyr)
read.csv("data/clusters-trends.csv", skip = 2) %>%
mutate(year = as.Date(paste(Month, "-01", sep = ""))) %>%
mutate(`On-Premise` = `mainframe...Worldwide.`,
Cloud = `cloud.computing...Worldwide.`,
Kubernetes = `kubernetes...Worldwide.`) %>%
tidyr::gather(`On-Premise`, Cloud, Kubernetes, key = "trend", value = "popularity") %>%
ggplot(aes(x=year, y=popularity, group=trend)) +
geom_line(aes(linetype = trend, color = trend)) +
scale_x_date(date_breaks = "2 year", date_labels = "%Y") +
labs(title = "Cluster Computing Trends",
subtitle = "Search popularity for on-premise (mainframe), cloud computing and kubernetes ") +
scale_color_grey(start = 0.6, end = 0.2) +
geom_hline(yintercept = 0, size = 1, colour = "#333333") +
theme(axis.title.x = element_blank()) +
ggsave("images/clusters-trends.png", width = 10, height = 5)
```
```{r clusters-trends, eval=TRUE, echo=FALSE, fig.align='center', out.width='100%', fig.cap='Google trends for on-premises (mainframe), cloud computing, and Kubernetes'}
render_image("images/clusters-trends.png", "Google trends for on-premise (mainframe), cloud computing and Kubernetes")
```
When purchasing hundreds or thousands of computing instances, it doesn’t make sense to keep them in the usual computing case that we are all familiar with; instead, it makes sense to stack them as efficiently as possible on top of one another to minimize the space the use. This<!--((("racks")))--> group of efficiently stacked computing instances is known as a [_rack_](https://oreil.ly/zKOr-). After a cluster grows to thousands of computers, you will also need to host hundreds of racks of computing devices; at this scale, you would also need significant physical space to host those racks.
A<!--((("datacenters")))--> building that provides racks of computing instances is usually known as a _datacenter_. At the scale of a datacenter, you would also need to find ways to make the building more efficient, especially the cooling system, power supplies, network connectivity, and so on. Since<!--((("Open Compute Project")))--> this is time-consuming, a few organizations have come together to open source their infrastructure under the [Open Compute Project](http://www.opencompute.org/) initiative, which provides a set of datacenter blueprints free for anyone to use.
There is nothing preventing you from building our own datacenter, and, in fact, many organizations have followed this path. For instance, Amazon started as an online bookstore, but over the years it grew to sell much more than just books. Along with its online store growth, its datacenters also grew in size. In 2002, Amazon<!--((("Amazon Web Services (AWS)")))--> considered [renting servers in their datacenters to the public](https://oreil.ly/Nx3BD), and two years later, Amazon Web Services (AWS) launched as a way to let anyone rent servers in the company's datacenters on demand, meaning that you did not need to purchase, configure, maintain, or tear down your own clusters; rather, you could rent them directly from AWS.
This<!--((("cloud computing", "defined")))--> on-demand compute model is what we know today as _cloud computing_. In the cloud, the cluster you use is not owned by you, and it’s not in your physical building; instead it’s a datacenter owned and managed by someone else. Today, there are many<!--((("cloud providers")))--> cloud providers in this space, including AWS, Databricks, Google, Microsoft, Qubole, and many others. Most cloud computing platforms provide a user interface through either a web application or command line to request and manage resources.
While the benefits of processing data in the _cloud_ were obvious for many years, picking a cloud provider had the unintended side effect of locking in organizations with one particular provider, making it hard to switch between providers or back to on-premises clusters. _Kubernetes_, <!--((("Kubernetes")))-->announced by Google in 2014, is an [open source system for managing containerized applications across multiple hosts](https://oreil.ly/u6H5X). In practice, it makes it easier to deploy across multiple cloud providers and on-premises as well.
In summary, we have seen a transition from on-premises to cloud computing and, more recently, Kubernetes. These<!--((("private cloud")))((("public cloud")))((("hybrid cloud")))--> technologies are often loosely described as the _private cloud_, the _public cloud_, and as one of the orchestration services that can enable a _hybrid cloud_, respectively. This chapter walks you through each cluster computing trend in the context of Spark and R.<!--((("", startref="Cover06")))-->
## On-Premise
As<!--((("clusters", "on-premises cluster computing", id="Conprem06")))((("on-premises cluster computing", "overview of")))--> mentioned in the overview section, on-premises clusters represent a set of computing instances procured and managed by staff members from your organization. These clusters can be highly customized and controlled; however, they can also incur higher initial expenses and maintenance costs.
When using on-premises Spark clusters, there are two concepts you should consider:
Cluster manager
: In a similar way as to how an operating system (like Windows or macOS) allows you to run multiple applications in the same computer, a cluster manager allows multiple applications to be run in the same cluster. You need to choose one yourself when working with on-premises clusters.
Spark distribution
: While you can install Spark from the Apache Spark site, many organizations partner with companies that can provide support and enhancements to Apache Spark, which we often refer to as Spark _distributions_.
### Managers {#clusters-manager}
To<!--((("cluster managers", "purpose of")))((("on-premises cluster computing", "cluster managers", id="OPCCmanag06")))--> run Spark within a computing cluster, you will need to run software capable of initializing Spark over each physical machine and register all the available computing nodes. This software is known as a [cluster manager](https://oreil.ly/Ye4zH). The available cluster managers in Spark are _Spark Standalone_, _YARN_, _Mesos_, and _Kubernetes_.
**Note:** In<!--((("compute instances")))((("compute nodes")))((("instances")))((("nodes")))--> distributed systems and clusters literature, we often refer to each physical machine as a _compute instance_, _compute node_, _instance_, or _node_.
#### Standalone {#clusters-standalone}
In _Spark Standalone_, Spark<!--((("Standalone clusters", id="standalonec06")))((("cluster managers", "Spark Standalone", id="CMstand06")))((("Spark Standalone", id="spstand06")))--> uses itself as its own cluster manager, which allows you to use Spark without installing additional software in your cluster. This can be useful if you are planning to use your cluster to run only Spark applications; if this cluster is not dedicated to Spark, a generic cluster manager like YARN, Mesos, or Kubernetes would be more suitable. The Spark Standalone [documentation](http://bit.ly/307YtM6) contains detailed information on configuring, launching, monitoring, and enabling high availability, as illustrated in Figure \@ref(fig:clusters-spark-standalone).
However, since Spark Standalone is contained within a Spark installation, by completing [Chapter 2](#starting), you have now a Spark installation available that you can use to initialize a local Spark Standalone cluster on your own machine. In practice, you would want to start the worker nodes on different machines, but for simplicity, we present the code to start a standalone cluster on a single machine.
First, retrieve the `SPARK_HOME` directory by running *`spark_home_dir()`*, and then start the master node and a worker node as follows:
```{r clusters-start-master, message=FALSE}
# Retrieve the Spark installation directory
spark_home <- spark_home_dir()
# Build paths and classes
spark_path <- file.path(spark_home, "bin", "spark-class")
# Start cluster manager master node
system2(spark_path, "org.apache.spark.deploy.master.Master", wait = FALSE)
```
```{r clusters-spark-standalone, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Spark Standalone website'}
render_image("images/clusters-spark-standalone.png", "Spark Standalone Site")
```
The previous command initializes the master node. You can access the master node interface at [_localhost:8080_](http://localhost:8080), as captured in Figure \@ref(fig:clusters-spark-standalone-web). Note that the Spark master URL is specified as _spark://address:port_; you will need this URL to initialize worker nodes.
We then can initialize a single worker using the master URL; however, you could use a similar approach to initialize multiple workers by running the code multiple times and, potentially, across different machines:
```{r clusters-start-worker}
# Start worker node, find master URL at http://localhost:8080/
system2(spark_path, c("org.apache.spark.deploy.worker.Worker",
"spark://address:port"), wait = FALSE)
```
```{r clusters-spark-standalone-web-code, echo=FALSE}
invisible(webshot::webshot(
"http://localhost:8080/",
"images/clusters-spark-standalone-web-ui.png",
cliprect = "viewport",
vheight = 744 * 0.7,
zoom = 2
))
```
```{r clusters-spark-standalone-web, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='The Spark Standalone web interface'}
render_image("images/clusters-spark-standalone-web-ui.png")
```
There is one worker register in Spark Standalone. Click the link to this worker node to view details for this particular worker, like available memory and cores, as shown in Figure \@ref(fig:clusters-spark-standalone-webui).
```{r clusters-spark-standalone-webui-code, echo=FALSE}
invisible(webshot::webshot(
"http://localhost:8081/",
"images/clusters-spark-standalone-web-ui-worker.png",
cliprect = "viewport",
vheight = 744 * 0.4,
zoom = 2
))
```
```{r clusters-spark-standalone-webui, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Spark Standalone worker web interface'}
render_image("images/clusters-spark-standalone-web-ui-worker.png")
```
After you are done performing computations in this cluster, you will need to stop the master and worker nodes. You can use the `jps` command to identify the process numbers to terminate. In the following example, `15330` and `15353` are the processes that you can terminate to finalize this cluster. To terminate a process, you can use `system("Taskkill /PID ##### /F")` in Windows, or `system("kill -9 #####")` in macOS and Linux.
```{r}
system("jps")
```
```
15330 Master
15365 Jps
15353 Worker
1689 QuorumPeerMain
```
You can follow a similar approach to configure a cluster by running the initialization code over each machine in the cluster.
While it’s possible to initialize a simple standalone cluster, configuring a proper Spark Standalone cluster that can recover from computer restarts and failures, and supports multiple users, permissions, and so on, is usually a much longer process that falls beyond the scope of this book. The following sections present several alternatives that can be much easier to manage on-premises or through cloud services. We will start by introducing YARN.<!--((("", startref="spstand06")))((("", startref="CMstand06")))((("", startref="standalonec06")))-->
#### Yarn
Hadoop<!--((("cluster managers", "Hadoop YARN")))((("Hadoop YARN")))((("YARN")))--> YARN, or simply YARN, as it is commonly called, is the resource manager of the Hadoop project. It was originally developed in the Hadoop project but was refactored into its own project in Hadoop 2. As we mentioned in [Chapter 1](#intro), Spark was built to speed up computation over Hadoop, and therefore it’s very common to find Spark installed on Hadoop clusters.
One advantage of YARN is that it is likely to be already installed in many existing clusters that support Hadoop; this means that you can easily use Spark with many existing Hadoop clusters without requesting any major changes to the existing cluster infrastructure. It is also very common to find Spark deployed in YARN clusters since many started out as Hadoop clusters and were eventually upgraded to also support Spark.
You can submit YARN applications in two modes: _yarn-client_ and _yarn-cluster_. In yarn-cluster mode the driver is running remotely (potentially), while in yarn-client mode, the driver is running locally. Both modes are supported, and we explain them further in [Chapter 7](#connections).
YARN provides a resource management user interface useful to access logs, monitor available resources, terminate applications, and more. After you connect to Spark from R, you will be able to manage the running application in YARN, as shown in Figure \@ref(fig:clusters-hadoop-yarn-site).
Since YARN is the cluster manager from the Hadoop project, you can find YARN’s documentation at [hadoop.apache.org](http://bit.ly/2TDGsCX). You can also reference the "Running Spark on YARN" guide at [spark.apache.org](http://bit.ly/306WsQx).
```{r clusters-hadoop-yarn-site, eval=TRUE, fig.align='center', echo=FALSE, fig.cap="YARN’s Resource Manager running a sparklyr application"}
render_image("images/clusters-yarn-resource-manager.png")
```
#### Mesos
Apache Mesos<!--((("cluster managers", "Apache Mesos")))((("Apache Mesos")))((("Mesos")))--> is an open source project to manage computer clusters. Mesos began as a research project in the UC Berkeley RAD Lab. It<!--((("Cgroups")))((("Linux Cgroups")))--> makes use of Linux [Cgroups](http://bit.ly/2Z9KEeW) to provide isolation for CPU, memory, I/O, and file system access.
Mesos, like YARN, supports executing many cluster frameworks, including Spark. However, one advantage particular to Mesos is that it allows cluster frameworks like Spark to implement custom task schedulers. A scheduler is the component that coordinates in a cluster which applications are allocated execution time and which resources are assigned to them. Spark<!--((("coarse-grained scheduler")))--> uses a [coarse-grained scheduler](https://oreil.ly/9WQvg), which schedules resources for the duration of the application; however, other frameworks<!--((("fine-grained scheduler")))--> might use Mesos’ fine-grained scheduler, which can increase the overall efficiency in the cluster by scheduling tasks in shorter intervals, allowing them to share resources between them.
Mesos provides a web interface to manage your running applications, resources, and so on. After connecting to Spark from R, your application will be registered like any other application running in Mesos. Figure \@ref(fig:clusters-mesos-webui) shows a successful connection to Spark from R.
Mesos is an Apache project with its documentation available at [mesos.apache.org](https://mesos.apache.org/). The [_Running Spark on Mesos_](http://bit.ly/31H4LCT) guide is also a great resource if you choose to use Mesos as your cluster manager.<!--((("", startref="OPCCmanag06")))-->
```{r clusters-mesos-webui, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Mesos web interface running Spark and R'}
render_image("images/clusters-mesos-webui.png")
```
### Distributions
You<!--((("Spark distributions", id="spdist06")))((("Cloudera", id="cloudera06")))((("on-premises cluster computing", "Spark distributions", id="OPCCspark06")))--> can use a cluster manager in on-premises clusters, as described in the previous section; however, many organizations—including, but not limited to, Apache Spark—choose to partner with companies providing additional management software, services, and resources to help manage applications in their cluster. Some of the on-premises cluster<!--((("Hortonworks", id="hortonw06")))((("MapR")))--> providers include _Cloudera_, _Hortonworks_, and _MapR_, which we briefly introduce next.
_Cloudera_, Inc., is a US-based software company that provides Apache Hadoop and Apache Spark–based software, support and services, and training to business customers. Cloudera’s hybrid open source Apache Hadoop distribution, Cloudera Distribution Including Apache Hadoop (CDH), targets enterprise-class deployments of that technology. Cloudera donates more than 50% of its engineering output to the various Apache-licensed open source projects (Apache Hive, Apache Avro, Apache HBase, and so on) that combine to form the Apache Hadoop platform. [Cloudera](http://bit.ly/2KJmcfe) is also a sponsor of the Apache Software Foundation.
Cloudera clusters<!--((("parcels")))--> make use of [_parcels_](http://bit.ly/33LHpxU), which are binary distributions containing program files and metadata. Spark happens to be installed as a parcel in Cloudera. It’s beyond the scope of this book to present how to configure Cloudera clusters, but resources and documentation can be found under [cloudera.com](http://bit.ly/33yUUkp), and ["Introducing sparklyr, an R Interface for Apache Spark"](http://bit.ly/2HbAtjY) on the Cloudera blog.
Cloudera provides the Cloudera Manager web interface to manage resources, services, parcels, diagnostics, and more. Figure \@ref(fig:clusters-cloudera-manager-spark) shows a Spark parcel running in Cloudera Manager, which you can later use to connect from R.
```{r clusters-cloudera-manager-spark, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Cloudera Manager running Spark parcel'}
render_image("images/clusters-cloudera-manager.png")
```
[`sparklyr` is certified with Cloudera](http://bit.ly/2z1yydc), meaning that Cloudera’s support is aware of `sparklyr` and can be effective helping organizations that are using Spark and R. Table \@ref(tab:clusters-cloudera-table) summarizes<!--((("sparklyr package", "versions certified with Cloudera")))--> the versions currently certified.
```{r clusters-cloudera-table, eval=TRUE, echo=FALSE}
knitr::kable(
data.frame(
`Cloudera Version` = rep("CDH5.9", 3),
`Product` = rep("sparklyr", 3),
Version = c(0.5, 0.6, 0.7),
Components = rep("HDFS, Spark", 3),
Kerberos = rep("Yes", 3)
),
booktabs = TRUE,
caption = 'Versions of sparklyr certified with Cloudera'
)
```
_Hortonworks_ is a big data software company based in Santa Clara, California. The company develops, supports, and provides expertise on an expansive set of entirely open source software designed to manage data and processing for everything from Internet of Things (IoT) to advanced analytics and machine learning. [Hortonworks](http://bit.ly/2KTufpV) believes that it is a data management company bridging the cloud and the datacenter.
[Hortonworks partnered with Microsoft](http://bit.ly/2NbfuBH) to improve support in Microsoft Windows for Hadoop and Spark, which used to be a differentiation point from Cloudera; however, comparing Hortonworks and Cloudera is less relevant today since the [companies merged in January 2019](http://bit.ly/2Mk1UMt). Despite the merger, support for the Cloudera and Hortonworks Spark distributions are still available. Additional resources to configure Spark under Hortonworks are available at [hortonworks.com](http://bit.ly/2Z8M8Kh).
_MapR_ is<!--((("MapR")))--> a business software company headquartered in Santa Clara, California. [MapR](http://bit.ly/33DU8Cs) provides access to a variety of data sources from a single computer cluster, including big data workloads such as Apache Hadoop and Apache Spark, a distributed file system, a multimodel database management system, and event stream processing, combining analytics in real time with operational applications. Its technology runs on both commodity hardware and public cloud computing services.<!--((("", startref="Conprem06")))((("", startref="OPCCspark06")))((("", startref="spdist06")))((("", startref="cloudera06")))((("", startref="hortonw06")))-->
## Cloud
If<!--((("cloud computing", "overview of")))((("clusters", "cloud cluster computing", id="Ccloud06")))--> you have neither an on-premises cluster nor spare machines to reuse, starting with a cloud cluster can be quite convenient since it will allow you to access a proper cluster in a matter of minutes. This section briefly mentions some of the major cloud infrastructure providers and gives you resources to help you get started if you choose to use a cloud provider.
In cloud services, the compute instances are billed for as long the Spark cluster runs; your billing starts when the cluster launches, and it stops when the cluster stops. This cost needs to be multiplied by the number of instances reserved for your cluster. So, for instance, if a cloud provider charges \$1.00 per compute instance per hour, and you start a three-node cluster that you use for one hour and 10 minutes, it is likely that you’ll receive a bill for \$1.00 x 2 hours x 3 nodes = \$6.00. Some cloud providers charge per minute, but at least you can rely on all of them charging per compute hour.
Be aware that, while computing costs can be quite low for small clusters, accidentally leaving a cluster running can cause significant billing expenses. Therefore, it's worth taking the extra time to check twice that your cluster is terminated when you no longer need it. It’s also a good practice to monitor costs daily while using clusters to make sure your expectations match the daily bill.
From past experience, you should also plan to request compute resources in advance while dealing with large-scale projects; various cloud providers will not allow you to start a cluster with hundreds of machines before requesting them explicitly through a support request. While this can be cumbersome, it’s also a way to help you control costs in your organization.
Since the cluster size is flexible, it is a good practice to start with small clusters and scale compute resources as needed. Even if you know in advance that a cluster of significant size will be required, starting small provides an opportunity to troubleshoot issues at a lower cost since it’s unlikely that your data analysis will run at scale flawlessly on the first try. As a rule of thumb, grow the instances exponentially; if you need to run a computation over an eight-node cluster, start with one node and an eighth of the entire dataset, then two nodes with a fourth, then four nodes with half of the dataset, and then, finally, eight nodes and the entire dataset. As you become more experienced, you’ll develop a good sense of how to troubleshoot issues, and of the size of the required cluster, and you’ll be able to skip intermediate steps, but for starters, this is a good practice to follow.
You can also use a cloud provider to acquire bare computing resources and then install the on-premises distributions presented in the previous section yourself; for instance, you can run the Cloudera distribution on Amazon Elastic Compute Cloud (Amazon EC2). This model would avoid procuring colocated hardware, but it still allows you to closely manage and customize your cluster. This book presents an overview of only the fully managed Spark services available by cloud providers; however, you can usually find with ease instructions online on how to install on-premises distributions in the cloud.
Some of the major providers of cloud computing infrastructure are Amazon, Databricks, Google, IBM, and Microsoft and Qubole. The subsections that follow briefly introduce each one.
### Amazon {#clusters-amazon-emr}
Amazon<!--((("Amazon Web Services (AWS)", id="aws06")))((("Amazon EMR", id="emr06")))((("cloud computing", "Amazon", id="CCamazon06")))--> provides cloud services through [AWS](https://aws.amazon.com/); more specifically, it provides an on-demand Spark cluster through [Amazon EMR](https://aws.amazon.com/emr/).
Detailed instructions on using R with Amazon EMR were published under Amazon’s Big Data blog in a post called ["Running sparklyr on Amazon EMR"](https://amzn.to/2OYWMQ5). This post introduced the launch of `sparklyr` and instructions to configure Amazon EMR clusters with `sparklyr`. For instance, it suggests you can use the [Amazon Command Line Interface](https://aws.amazon.com/cli/) to launch a cluster with three nodes, as follows:
```{bash eval=FALSE}
aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Hive \
--release-label emr-5.8.0 --service-role EMR_DefaultRole --instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.2xlarge \
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.2xlarge \
--bootstrap-action Path=s3://aws-bigdata-blog/artifacts/aws-blog-emr-\
rstudio-sparklyr/rstudio_sparklyr_emr5.sh,Args=["--user-pw", "<password>", \
"--rstudio", "--arrow"] --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole
```
You can then see the cluster launching and then eventually running under the AWS portal, as illustrated in Figure \@ref(fig:clusters-amazon-emr-launching).
You then can navigate to the Master Public DNS and find RStudio under port 8787—for example, `ec2-12-34-567-890.us-west-1.compute.amazonaws.com:8787`—and then log in with user `hadoop` and password `password`.
```{r clusters-amazon-emr-launching, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Launching an Amazon EMR cluster'}
render_image("images/clusters-amazon-emr-launching.png")
```
It is also possible to launch the Amazon EMR cluster using the web interface; the same introductory post contains additional details and walkthroughs specifically designed for Amazon EMR.
Remember to turn off your cluster to avoid unnecessary charges and use appropriate security restrictions when starting Amazon EMR clusters for sensitive data analysis.
Regarding cost, you can find the most up-to-date information at https://amzn.to/2YRGb5r[Amazon EMR Pricing]. Table \@ref(tab:clusters-amazon-pricing) presents some of the instance types available in the `us-west-1` region (as of this writing); this is meant to provide a glimpse of the resources and costs associated with cloud processing. Notice that the "EMR price is in addition to the Amazon EC2 price (the price for the underlying servers)".
```{r clusters-amazon-pricing, eval=TRUE, echo=FALSE}
knitr::kable(
data.frame(
Instance = c( "c1.medium", "m3.2xlarge", "i2.8xlarge"),
CPUs = c( 2, 8, 32),
Memory = c( "1.7GB", "30GB", "244GB"),
Storage = c( "350GB", "160GB", "6400GB"),
`EC2 Cost` = c("$0.148 USD/hr", "$0.616 USD/hr", "$7.502 USD/hr"),
`EMR Cost` = c("$0.030 USD/hr", "$0.140 USD/hr", "$0.270 USD/hr")
),
booktabs = TRUE,
caption = "Amazon EMR pricing information"
)
```
**Note:** We are presenting only a subset of the available compute instances for Amazon and subsequent cloud providers as of 2019; however, note that hardware (CPU speed, hard drive speed, etc.) varies between vendors and locations; therefore, you can’t use these hardware tables as an accurate price comparison. Accurate comparison would require running your particular workloads and considering other aspects beyond compute instance cost.<!--((("", startref="CCamazon06")))((("", startref="emr06")))((("", startref="aws06")))-->
### Databricks
[Databricks](https://databricks.com) is<!--((("Databricks")))((("AMPLab project")))((("cloud computing", "Databricks")))--> a company founded by the creators of Apache Spark, whose aim is to help clients with cloud-based big data processing using Spark. Databricks grew out of the [AMPLab](https://oreil.ly/W2Eoe) project at the University of California, Berkeley.
Databricks provides enterprise-level cluster computing plans as well as a free/community tier to explore functionality and become familiar with their environment.
After<!--((("RStudio", "Databricks and")))--> a cluster is launched, you can use R and `sparklyr` from Databricks notebooks following the steps provided in [Chapter 2](#starting) or by installing [RStudio on Databricks](http://bit.ly/2KCDax6). Figure \@ref(fig:clusters-databricks-notebook) shows a Databricks notebook using Spark through `sparkylyr`.
```{r clusters-databricks-notebook, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Databricks community notebook running sparklyr'}
render_image("images/clusters-databricks-sparklyr.png")
```
Additional resources are available under the Databricks Engineering Blog post ["Using sparklyr in Databricks"](http://bit.ly/2N59jyR) and the [Databricks documentation for `sparklyr`](http://bit.ly/2MkOYWC).
You can find the latest pricing
information at [_databricks.com/product/pricing_](http://bit.ly/305Rnrt). Table \@ref(tab:clusters-databricks-pricing) lists the available plans as of this writing.
```{r clusters-databricks-pricing, eval=TRUE, echo=FALSE}
knitr::kable(
data.frame(
Plan = c( "AWS Standard", "Azure Standard", "Azure Premium"),
Basic = c("$0.07 USD/DBU", "", ""),
`Data Engineering` = c("$0.20 USD/DBU", "$0.20 USD/DBU", "$0.35 USD/DBU"),
`Data Analytics` = c("$0.40 USD/DBU", "$0.40 USD/DBU", "$0.55 USD/DBU")
),
booktabs = TRUE,
caption = "Databricks procong information."
)
```
Notice<!--((("DBU per hour")))--> that pricing is based on cost of DBU per hour. From Databricks, "a [Databricks Unit](https://oreil.ly/3muQq) (DBU) is a unit of Apache Spark processing capability per hour. For a varied set of instances, DBUs are a more transparent way to view usage instead of the node-hour".
### Google
Google<!--((("cloud computing", "Google")))((("Google Cloud Platform (GCP)")))--> provides Google Cloud Dataproc as a cloud-based managed Spark and Hadoop service offered on Google Cloud Platform (GCP). Dataproc utilizes many GCP technologies, such as Google Compute Engine and Google Cloud Storage, to offer fully managed clusters running popular data processing frameworks such as Apache Hadoop and Apache Spark.
You can easily create a cluster from the Google Cloud console or the Google Cloud command-line interface (CLI) as illustrated in Figure \@ref(fig:clusters-google-dataproc-launch).
```{r clusters-google-dataproc-launch, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Launching a Dataproc cluster'}
render_image("images/clusters-dataproc-launching.png")
```
After you've created your cluster, ports can be forwarded to allow you to access this cluster from your machine—for instance, by launching Chrome to make use of this proxy and securely connect to the Dataproc cluster. Configuring this connection looks as follows:
```{bash eval=FALSE}
gcloud compute ssh sparklyr-m --project=<project> --zone=<region> -- -D 1080 \
-N "<path to chrome>" --proxy-server="socks5://localhost:1080" \
--user-data-dir="/tmp/sparklyr-m" http://sparklyr-m:8088
```
There are various [tutorials available](http://bit.ly/2OYyo18) (cloud.google.com/dataproc/docs/tutorials), including a comprehensive [tutorial to configure RStudio and `sparklyr`](http://bit.ly/2MhSgKg).
You can find the latest pricing information at [_cloud.google.com/dataproc/pricing_](http://bit.ly/31J0uyC). In Table \@ref(tab:clusters-google-pricing) notice that the cost is split between compute engine and a dataproc premium.
```{r clusters-google-pricing, eval=TRUE, echo=FALSE}
knitr::kable(
data.frame(
Instance = c( "n1-standard-1", "n1-standard-8", "n1-standard-64"),
CPUs = c( 1, 8, 64),
Memory = c( "3.75GB", "30GB", "244GB"),
`Compute Engine` = c("$0.0475 USD/hr", "$0.3800 USD/hr", "$3.0400 USD/hr"),
`Dataproc Premium` = c( "$0.010 USD/hr", "$0.080 USD/hr", "$0.640 USD/hr")
),
booktabs = TRUE,
caption = "Databricks procong information."
)
```
### IBM
IBM cloud<!--((("cloud computing", "IBM")))((("IBM Cloud")))--> computing is a set of cloud computing services for business. IBM cloud includes Infrastructure as a Service (IaaS), Software as a Service (SaaS), and Platform as a Service (PaaS) offered through public, private, and hybrid cloud delivery models, in addition to the components that make up those clouds.
From within IBM Cloud, open Watson Studio and create a Data Science project, add a Spark cluster under the project settings, and then, on the Launch IDE menu, start RStudio. Please note that, as of this writing, the provided version of `sparklyr` was not the latest version available in CRAN, since `sparklyr` was modified to run under the IBM Cloud. In any case, follow IBM's documentation as an authoritative reference to run R and Spark on the IBM Cloud and particularly on how to upgrade `sparklyr` appropriately. Figure \@ref(fig:clusters-ibm-portal) captures IBM’s Cloud portal launching a Spark cluster.
```{r clusters-ibm-portal, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='IBM Watson Studio launching Spark with R support'}
render_image("images/clusters-ibm-sparklyr.png")
```
The most up-to-date pricing information is available at [_ibm.com/cloud/pricing_](https://www.ibm.com/cloud/pricing). In Table \@ref(tab:clusters-ibm-pricing), compute cost was normalized using 31 days from the per-month costs.
```{r clusters-ibm-pricing, eval=TRUE, echo=FALSE}
knitr::kable(
data.frame(
Instance = c( "C1.1x1x25", "C1.4x4x25", "C1.32x32x25"),
CPUs = c( 1, 4, 32),
Memory = c( "1GB", "4GB", "25GB"),
Storage = c( "25GB", "25GB", "25GB"),
Cost = c("$0.033 USD/hr", "$0.133 USD/hr", "$0.962 USD/hr")
),
booktabs = TRUE,
caption = "IBM Cloud pricing information"
)
```
### Microsoft
Microsoft Azure<!--((("Microsoft Azure")))((("Azure")))((("cloud computing", "Microsoft")))--> is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through a global network of Microsoft-managed datacenters. It provides SaaS, PaaS, and IaaS and supports many different programming languages, tools, and frameworks, including both Microsoft-specific and third-party software and systems.
From the Azure portal, the Azure<!--((("Microsoft HDInsight")))((("HDInsight")))--> HDInsight service provides support for on-demand Spark clusters. You can easily create HDInsight cluster with support for Spark and RStudio by selecting the ML Services cluster type. Note that the provided version of `sparklyr` might not be the latest version available in CRAN since the default package repository seems to be initialized using a Microsoft R Application Network (MRAN) snapshot, not directly from CRAN. Figure \@ref(fig:clusters-azure-hdinsight-mlservices) shows the Azure portal launching a Spark cluster with support for R.
```{r clusters-azure-hdinsight-mlservices, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Creating an Azure HDInsight Spark cluster'}
render_image("images/clusters-azure-mlservices.png")
```
Up-to-date pricing for HDInsight is available at [_azure.microsoft.com/en-us/pricing/details/hdinsight_](http://bit.ly/2H9Ce0X); Table \@ref(tab:clusters-azure-pricing) lists the pricing as of this writing.
```{r clusters-azure-pricing, eval=TRUE, echo=FALSE}
knitr::kable(
data.frame(
Instance = c( "D1 v2", "D4 v2", "G5"),
CPUs = c( 1, 8, 64),
Memory = c( "3.5 GB", "28 GB", "448 GB"),
`Total Cost` = c("$0.074/hour", "$0.59/hour", "$9.298/hour")
),
booktabs = TRUE,
caption = "Azure HDInsight pricing information"
)
```
### Qubole
[Qubole](https://www.qubole.com) was<!--((("cloud computing", "Qubole")))((("Qubole")))--> founded in 2013 with a mission to close the data accessibility gap. Qubole delivers a self-service platform for big data analytics built on Amazon, Microsoft, Google, and Oracle Clouds. In Qubole, you can launch Spark clusters, which you can then use from [Qubole notebooks](http://bit.ly/33ChKYk) or RStudio Server. Figure \@ref(fig:clusters-qubole-notebook) shows a Qubole cluster initialized with RStudio and `sparklyr`.
```{r clusters-qubole-notebook, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='A Qubole cluster initialized with RStudio and sparklyr'}
render_image("images/clusters-qubole-sparklyr.png")
```
You can find the latest pricing information at [Qubole's pricing page](https://bit.ly/2DKPvhP). Table \@ref(tab:clusters-qubole-pricing) lists the price for Qubole's current plan, as of this writing. Notice that pricing is based on cost of QCU/hr, which stands for “Qubole Compute Unit per hour,” and the Enterprise Edition requires an annual contract.<!--((("", startref="Ccloud06")))-->
```{r clusters-qubole-pricing, eval=TRUE, echo=FALSE}
knitr::kable(
data.frame(
`Test Drive` = "$0 USD",
`Full-Featured Trial` = "$0 USD",
`Enterprise Edition` = "$0.14 USD/QCU"
),
booktabs = TRUE,
caption = "Qubole pricing information"
)
```
## Kubernetes
Kubernetes<!--((("Kubernetes")))((("clusters", "Kubernetes cluster computing")))--> is an open source container orchestration system for automating deployment, scaling, and management of containerized applications that was originally designed by Google<!--((("Cloud Native Computing Foundation (CNCF)")))--> and is now maintained by the [Cloud Native Computing Foundation](https://www.cncf.io/) (CNCF). Kubernetes<!--((("Docker")))--> was originally based on [Docker](https://www.docker.com/), while, like Mesos, it’s also based on<!--((("Linux Cgroups")))((("Cgroups")))--> Linux Cgroups.
Kubernetes can execute many cluster applications and frameworks that you can highly customize by using container images with specific resources and libraries. This allows a single Kubernetes cluster to be used for many different purposes beyond data analysis, which in turn helps organizations manage their compute resources with ease. One trade-off from using custom images is that they add further configuration overhead but make Kubernetes clusters extremely flexible. Nevertheless, this flexibility has proven to be instrumental to easily administer cluster resources in many organizations and, as pointed out in [Overview](#clusters-overview), Kubernetes is becoming a very popular cluster framework.
Kubernetes is supported across all major cloud providers. They all provide extensive documentation as to how to launch, manage, and tear down Kubernetes clusters; Figure \@ref(fig:clusters-kubernetes-google-console) shows the GCP console while creating a Kubernetes cluster. You can deploy Spark over any Kubernetes cluster, and you can use R to connect, analyze, model, and more.
```{r clusters-kubernetes-google-console, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Creating a Kubernetes cluster for Spark and R using Google Cloud'}
render_image("images/clusters-kubernetes-google-console.png", "Creating a Kubernetes cluster for Spark and R using Google Cloud")
```
You can learn more at [kubernetes.io](https://kubernetes.io/), and read the _Running Spark on Kubernetes_ guide from [spark.apache.org](http://bit.ly/2KAZze7).
Strictly speaking, Kubernetes is a cluster technology, not a specific cluster architecture. However, Kubernetes represents a larger trend<!--((("hybrid cloud")))--> often referred to as a _hybrid cloud_. A hybrid cloud is a computing environment that makes use of on-premises and public cloud services with orchestration between the various platforms. It’s still too early to precisely categorize the leading technologies that will form a hybrid approach to cluster computing; although, as previously mentioned, Kubernetes is the leading one, many more are likely to form to complement or even replace existing technologies.
## Tools
While<!--((("clusters", "complementary tools used for", id="Ctools06")))--> using only R and Spark can be sufficient for some clusters, it is common to install complementary tools in your cluster to improve monitoring, SQL analysis, workflow coordination, and more, with applications<!--((("Ganglia")))((("Hue")))((("Oozie")))--> like [Ganglia](http://ganglia.info/), [Hue](http://gethue.com/), and [Oozie](https://oozie.apache.org), respectively. This section is not meant to cover all tools; rather, it mentions the ones that are commonly used.
### RStudio
From<!--((("RStudio Server")))--> reading [Chapter 1](#intro), you are aware that RStudio is a well-known and free desktop development environment for R; therefore, it is likely that you are following the examples in this book using RStudio Desktop. However, you might not be aware that you can run RStudio as a web service within a Spark cluster. This version of RStudio is known as _RStudio Server_. You can see RStudio Server running in Figure \@ref(fig:clusters-rstudio-server). In the same way that the Spark UI runs in the cluster, you can install RStudio Server within the cluster. Then you can connect to RStudio Server and use RStudio in exactly the same way you use RStudio Desktop but with the ability to run code against the Spark cluster. As you can see in Figure \@ref(fig:clusters-rstudio-server), RStudio Server looks and feels just like RStudio Desktop, but adds support to run commands efficiently by being located within the cluster.
```{r clusters-rstudio-server, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='RStudio Server Pro running inside Apache Spark'}
render_image("images/clusters-rstudio-server.png", "RStudio Server")
```
If you're familiar with R, Shiny Server<!--((("Shiny Server")))--> is a very popular tool for building interactive web applications from R. We recommended that you install Shiny directly in your Spark cluster.
RStudio Server and Shiny Server are a free and open source; however, RStudio also provides professional products like RStudio Server, [RStudio Server Pro](http://bit.ly/2KCaxQn), [Shiny Server Pro](http://bit.ly/30aV0fK), and [RStudio Connect](http://bit.ly/306fHcY), which you can install within the cluster to support additional R workflows. While `sparklyr` does not require any additional tools, they provide significant productivity gains worth considering. You can learn more about them at [_rstudio.com/products/_](http://bit.ly/2MihHLP).
### Jupyter
Project<!--((("Jupyter")))--> [Jupyter](http://jupyter.org/) exists to develop open source software, open standards, and services for interactive computing across dozens of programming languages. A Jupyter notebook provides support for various programming languages, including R. You can use `sparklyr` with Jupyter notebooks using the R Kernel. Figure \@ref(fig:clusters-jupyter-sparklyr) shows `sparklyr` running within a local Jupyter notebook.
```{r clusters-jupyter-sparklyr, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Jupyter notebook running sparklyr'}
render_image("images/clusters-jupyter.png", "Jupyter notebook running sparklyr")
```
### Livy {#clusters-livy}
[Apache Livy](http://bit.ly/2L2TZAn) is<!--((("Apache Livy")))((("Livy")))--> an incubation project in Apache providing support to use Spark clusters remotely through a web interface. It is ideal to connect directly into the Spark cluster; however, there are times where connecting directly to the cluster is not feasible. When facing those constraints, you can consider installing Livy in the cluster and secure it properly to enable remote use over web protocols. Be aware, though, that there is a significant performance overhead from using Livy in `sparklyr`.
To help test Livy locally, `sparklyr` provides support to list, install, start, and stop a local Livy instance by executing `livy_available_versions()`:
```{r clusters-livy-header, echo=FALSE, eval=TRUE}
library(sparklyr)
# List versions of Livy available to install
livy_available_versions()
```
This lists the versions that you can install; we recommend installing the latest version and verifying it as follows:
```{r clusters-livy-install}
# Install default Livy version
livy_install()
# List installed Livy services
livy_installed_versions()
# Start the Livy service
livy_service_start()
```
You then can navigate to the local Livy session at http://localhost:8998. [Chapter 7](#connections) will detail how to connect through Livy. After you're connected, you can navigate to the Livy web application, as shown in Figure \@ref(fig:clusters-livy-local).
```{r clusters-livy-local, eval=TRUE, fig.width=4, fig.align='center', echo=FALSE, fig.cap='Apache Livy running as a local service'}
render_image("images/clusters-livy-local.png", "Apache Livy running as a local service")
```
Make<!--((("", startref="Ctools06")))--> sure you also stop the Livy service when working with local Livy instances (for proper Livy services running in a cluster, you won’t have to):
```{r clusters-livy-stop}
# Stops the Livy service
livy_service_stop()
```
## Recap
This chapter explained the history and trade-offs of on-premises and cloud computing and presented Kubernetes as a promising framework to provide flexibility across on-premises or multiple cloud providers. It also introduced cluster managers (Spark Standalone, YARN, Mesos, and Kubernetes) as the software needed to run Spark as a cluster application. This chapter briefly mentioned on-premises cluster providers like Cloudera, Hortonworks, and MapR, as well as the major Spark cloud providers: Amazon, Databricks, IBM, Google, Microsoft, and Qubole.
While this chapter provided a solid foundation to understand current cluster computing trends, tools, and service providers useful to perform data science at scale, it did not provide a comprehensive framework to help you decide which cluster technologies to choose. Instead, use this chapter as an overview and a starting point to seek out additional resources to help you find the cluster stack that best fits your organization needs.
[Chapter 7](#connections) will focus on understanding how to connect to existing clusters; therefore, it assumes a Spark cluster like those we presented in this chapter is already available to you.