-
Notifications
You must be signed in to change notification settings - Fork 92
Cluster Spinup Tool
Use the Cluster Spin Tool to bring up a Sparkline BI Accelerator cluster on AWS. The Cluster is setup with EMR(Hadoop + Spark), Druid 0.9.0 and Sparkline Accelerator. The Cluster starts with Druid and Sparkline Thriftserver running, and optionally serving the Druid DataSource specified in the configuration files.
The tool is driven from a ConfigFile, use it to specify the following information:
- The Cluster Name
- The location of configuration files for Druid, Spark, and Sparkline.
- Your EC2 keypair
- An AWS SecurityGroupID for the machine spun up.
- The InstanceTypes for the master and salve nodes. For example “m3.xlarge”
- The size of the cluster.
- Bid price for the machines.
- The availability zone to spin up the machines in.
Here is an example ConfigFile to spinup a cluster for the TPCH-1 demo. The configFile provides step-by-step documentation on each setting.
Detail configurations are under a configuration folder. The structure of the folder is the following:
Here is a detailed configuration example for the TPCH-1 demo.
The folder contains the configurations for all the Druid daemons. See the Druid documentation for details on the different configuration options. We enable you setup all the configuration scripts here, which are then deployed to the cluster.
See druid documentation for details. We pick up these settings from
the _common
sub-folder. The druid cluster is setup to run with emr
hadoop and also has the s3, hdfs and mysql extensions installed. So
you can enable these in the _common settings. You can point to an
existing mysql instance(the spinup tool doesn’t install mysql).
See druid documentation for details. The coordinator
and broker
sub-folders have their runtime.properties and jvm settings.
See druid documentation for details. We setup the slaves with 2
mount points (/mnt, /mnt1) to use as druid local storage. The
processing threads, http threads and jvm settings should be adjusted based on the class
machine used. The historical
sub-folder has the runtime.properties and jvm settings.
See druid documentation for details. Druid is setup to run indexing
using hadoop. Hadoop Client is setup to work with EMR. The
overlord
sub-folder has the runtime.properties and jvm settings.
The cluster is spun up with the coordinator and broker running on the master node, and each slave running a historical daemon. Zookeper is setup on the master node. The machines have aliases starting/stopping the druid daemons. For indexing the overlord is started on the master node.
Use the sparkline
sub-folder to configure spark and sparkline. The
sparkline.spark.properties
contains params to configure the spark
cluster. The default emr spark-defaults.conf is appended to the end
of this file; so you don’t need to specify log locations, emr
libraries, executors and memory etc. This folder also contains the
sparkline jar that should be deployed.
On spin-up the sparkline enhanced spark thriftserver is started on the master node using the provided sparkline.spark.properties.
If you point to Druid metastore that already has segments defined,
then on startup the historicals will starting pulling these segments
to local store and start serving the indexes defined. The
deep-storage must be accessible by historicals, so make sure the s3
bucket has the right permissions. In order to start running sqls
against these indexes you need to define th raw table
and the
druid datasource
in Spark. Use the ddl script to specify
these. These are setup using spark-sql
cli before staring the
thriftserver. Here is the ddl script for the TPCH-1 demo.
- Overview
- Quick Start
-
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
- Dev. Guide
- Reference Architectures
- Releases
- Cluster Spinup Tool
- TPCH Benchmark