This initialization action installs BigDL on a Dataproc cluster. BigDL is a distributed deep learning library for Apache Spark. See the GitHub BigDL website for more information.
This script also supports the Intel Analytics Zoo, which includes BigDL. See the GitHub Analytics Zoo website site for more information.
Use this initialization action to create a Dataproc cluster with BigDL's Spark and PySpark libraries installed.
Note: In the following examples, a 10-minute timeout is set with the --initialization-action-timeout 10m
flag to allow for the time needed to install BigDL on cluster nodes.
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \
--initialization-action-timeout 10m
By default, this initialization action script downloads BigDL 0.7.2 for Dataproc 1.3 (Spark 2.3.0 and Scala 2.11.8). To download a different BigDL or Analytics Zoo distribution version or one targeted to a different version of Spark/Scala, find the download URL on the BigDL releases page or in the Maven repository, then set the bigdl-download-url
metadata key. The URL should end in -dist.zip
.
Example for Dataproc 1.0 (Spark 1.6 and Scala 2.10) and BigDL v0.7.2:
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--image-version 1.0 \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \
--initialization-action-timeout 10m \
--metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/bigdl/dist-spark-1.6.2-scala-2.10.5-all/0.7.2/dist-spark-1.6.2-scala-2.10.5-all-0.7.2-dist.zip'
Example for Dataproc 1.3 (Spark 2.3) and Analytics Zoo 0.4.0 with BigDL v0.7.2:
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--image-version 1.3 \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \
--initialization-action-timeout 10m \
--metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/zoo/analytics-zoo-bigdl_0.7.2-spark_2.3.1/0.4.0/analytics-zoo-bigdl_0.7.2-spark_2.3.1-0.4.0-dist-all.zip'
- You cannot use preemptible VMs with this initilization action, and cannot scale (add or remove workers from) the cluster. BigDL expects a fixed number of Spark executors and cores per executor to make optimizations for Intel's MKL library (shipped with BigDL). This initilization action statically sets
spark.executor.instances
based on the original size of the cluster, and disables dynamic allocation (spark.dynamicAllocation.enabled=false
). - This initilization action sets
spark.executor.instances
so that a single application uses all cluster resources. To run multiple applications simulatenously, overridespark.executor.instances
on each job by adding the--properties
flag togcloud dataproc jobs submit [spark|pyspark|spark-sql]
or the--conf
flag tospark-shell
/spark-submit
. Note that each application schedules an app master in addition to the executors.