Generating Denormalized TPCH Dataset

These instructions are for creating the Flattened dataset when running locally(in developer environment).

Use TPCH DBGen tool to generate the tpch dataset for a certain datascale. DataScale 1 should be more than enough for a dev. environment.
Clone and build tpch utils package. To build issue the commands: cd tpchData; sbt clean compile package (You need sbt installed for this)
Download a spark version. As of this writing, we have tested with spark-1.5.2
Issue the following to create the flattened dataset:

bin/spark-submit \
--packages com.databricks:spark-csv_2.10:1.1.0,SparklineData:spark-datetime:0.0.2,SparklineData:spark-druid-olap:0.0.2 \
--class org.sparklinedata.tpch.TpchGenMain   \
/Users/hbutani/sparkline/tpch-spark-druid/tpchData/target/scala-2.10/tpchdata-assembly-0.0.1.jar \
--baseDir /Users/hbutani/tpch/ --scale 1

where:

"/Users/hbutani/sparkline/tpch-spark-druid/tpchData/target/scala-2.10/tpchdata-assembly-0.0.1.jar' is the location of the tpch-utils jar
"/Users/hbutani/tpch/" is the location of the tpch data. Under this folder there are one or more datascale folders whose names are of the form 'datascale%n'(for e.g. 'datascale1')

The flattened dataset is written to a subfolder named 'orderLineItemPartSupplierCustomer' under the datascale folder

Overview
Quick Start
- Installing and Setup Druid
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
Dev. Guide
Reference Architectures
- Accelerating existing SQL Datasets
Releases
Cluster Spinup Tool
TPCH Benchmark
- Generating Denormalized TPCH Dataset
- Build TPCH Index for Benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating Denormalized TPCH Dataset

Clone this wiki locally