Cluster Spinup Tool

Use the Cluster Spin Tool to bring up a Sparkline BI Accelerator cluster on AWS. The Cluster is setup with EMR(Hadoop + Spark), Druid 0.9.0 and Sparkline Accelerator. The Cluster starts with Druid and Sparkline Thriftserver running, and optionally serving the Druid DataSource specified in the configuration files.

The tool is driven from a ConfigFile, use it to specify the following information:

The Cluster Name
The location of configuration files for Druid, Spark, and Sparkline.
Your EC2 keypair
An AWS SecurityGroupID for the machine spun up.
The InstanceTypes for the master and salve nodes. For example “m3.xlarge”
The size of the cluster.
Bid price for the machines.
The availability zone to spin up the machines in.

Here is an example ConfigFile to spinup a cluster for the TPCH-1 demo. The configFile provides step-by-step documentation on each setting.

Cluster Configuration details.

Detail configurations are under a configuration folder. The structure of the folder is the following:

Here is a detailed configuration example for the TPCH-1 demo.

Druid Configuration

The folder contains the configurations for all the Druid daemons. See the Druid documentation for details on the different configuration options. We enable you setup all the configuration scripts here, which are then deployed to the cluster.

Common Configuration

See druid documentation for details. We pick up these settings from the _common sub-folder. The druid cluster is setup to run with emr hadoop and also has the s3, hdfs and mysql extensions installed. So you can enable these in the _common settings. You can point to an existing mysql instance(the spinup tool doesn’t install mysql).

Coordinator, Broker Configuration

See druid documentation for details. The coordinator and broker sub-folders have their runtime.properties and jvm settings.

Historical Configuration

See druid documentation for details. We setup the slaves with 2 mount points (/mnt, /mnt1) to use as druid local storage. The processing threads, http threads and jvm settings should be adjusted based on the class machine used. The historical sub-folder has the runtime.properties and jvm settings.

Overlord Configuration

See druid documentation for details. Druid is setup to run indexing using hadoop. Hadoop Client is setup to work with EMR. The overlord sub-folder has the runtime.properties and jvm settings.

The cluster is spun up with the coordinator and broker running on the master node, and each slave running a historical daemon. Zookeper is setup on the master node. The machines have aliases starting/stopping the druid daemons. For indexing the overlord is started on the master node.

Spark configuration

Use the sparkline sub-folder to configure spark and sparkline. The sparkline.spark.properties contains params to configure the spark cluster. The default emr spark-defaults.conf is appended to the end of this file; so you don’t need to specify log locations, emr libraries, executors and memory etc. This folder also contains the sparkline jar that should be deployed.

On spin-up the sparkline enhanced spark thriftserver is started on the master node using the provided sparkline.spark.properties.

Initial DDL

If you point to Druid metastore that already has segments defined, then on startup the historicals will starting pulling these segments to local store and start serving the indexes defined. The deep-storage must be accessible by historicals, so make sure the s3 bucket has the right permissions. In order to start running sqls against these indexes you need to define th raw table and the druid datasource in Spark. Use the ddl script to specify these. These are setup using spark-sql cli before staring the thriftserver. Here is the ddl script for the TPCH-1 demo.

Overview
Quick Start
- Installing and Setup Druid
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
Dev. Guide
Reference Architectures
- Accelerating existing SQL Datasets
Releases
Cluster Spinup Tool
TPCH Benchmark
- Generating Denormalized TPCH Dataset
- Build TPCH Index for Benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly