-
Notifications
You must be signed in to change notification settings - Fork 385
Running A Spark Standalone Cluster
Note: Standalone mode is only supported on the dev branch right now.
- Download and compile Spark as described in the README.
- If you wish to use the EC2 Scripts (described below), you must have an Amazon AWS Account and an EC2 key pair.
In order to run a Spark cluster there are two main points of configuration, the conf/spark-env.sh
file, and the conf/slaves
file. the conf/spark-env.sh
file lets you specify settings for the master and slave instances, such as memory, or port numbers to bind on. The file itself is well documented and all configuration variables are explained.
The conf/slaves
file contains a list of all machines where you would like to start a Spark slave (worker) instance when using then scripts below. The master machine must be able to access each of the slave machines via ssh. For testing purposes, you can have a single localhost
entry in the slaves file.
In order to make starting master and slave instances easier, we have provided Hadoop-style shell scripts. The scripts can be found in the bin
directory. A quick overview:
-
bin/start_master
- Starts a master instance on the machine the script is executed on. -
bin/start_slaves
- Starts a slave instance on each machine specified in theconf/slaves
file. -
bin/start_all
- Starts both a master and a number of slaves as described above. -
bin/stop_master
- Stops the master that was started via thebin/start_master
script. -
bin/stop_slaves
- Stops the slave intances that were started via thebin/start_slaves
script. -
bin/stop_all
- Stops both the master and the slaves as described above.
Note that the scripts must be executed on the machine you want to start the Spark master on, not your local machine. By default, both the master and slaves write their log files to the logs
in the Spark home directory.
To run a job against a standalone cluster, pass a URI of the form spark://<SPARK_MASTER_IP>:<SPARK_MASTER_PORT>
as the Master's URI when setting up a Spark Context (or using the interactive shell).
In order to allow for easy debugging of your cluster we have provided a simple web interface that described the current state of the cluster and jobs. By default you can access it via http://<SPARK_MASTER_IP>:8080
, but you can specify a different port number inconf/spark_env.sh
.
To save you from needing to set up a cluster of Spark machines yourself, we provide a set of scripts that launch Amazon EC2 instances with a preinstalled Spark distribution. These scripts are identical to the EC2 Mesos Scripts, except that you need to execute ec2/spark-ec2
with the following additional parameters: --cluster-type standalone -a standalone --user ec2-user
. Note that the Spark version on these machines may not reflect the latest changes, so it may be a good idea to ssh into the machines and merge the latest version from github.