GitHub - Stratio/tpcds: TPC-DS benchmarks including data generation with Spark and queries with Spark

Stratio / tpcds Public

forked from JonathanMace/tpcds

Notifications You must be signed in to change notification settings
Fork 6
Star 31

TPC-DS benchmarks including data generation with Spark and queries with Spark

31 stars 21 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
src/main		src/main
tpcds-gen		tpcds-gen
.gitignore		.gitignore
README		README
pom.xml		pom.xml

Repository files navigation

Usage:

To compile, invoke

	mvn clean package

For convenience, set TPCDS_WORKLOAD_GEN to the directory where this git repository is checked out, eg:

    export TPCDS_WORKLOAD_GEN=~/tpcds

To generate data with spark

	bin/spark-submit --class edu.brown.cs.systems.tpcds.spark.SparkTPCDSDataGenerator ${TPCDS_WORKLOAD_GEN}/target/spark-workloadgen-5.0-jar-with-dependencies.jar
	
To run:

	bin/spark-submit --class edu.brown.cs.systems.tpcds.spark.SparkTPCDSWorkloadGenerator ${TPCDS_WORKLOAD_GEN}/target/spark-workloadgen-5.0-jar-with-dependencies.jar

To configure the TPC-DS data set, there are a variety of configuration options.  Most of these are inherited from Databricks spark-sql-perf, which we use to generate the TPC-DS data.

The options of interest are as follows:

 - scaleFactor specifies the dataset size.  A scale factor of n generates approximately n GB of data.  Most data formats compress this quite effectively, so on disk the data will appear smaller (eg, Parquet or Orc can compress by a factor of approximately 4).
 - dataLocation specifics the location of the dataset.  Typically this will be in HDFS, and you can specify HDFS file locations as normal (eg, hdfs://<hostname>:<port>/<path>)
 - dataFormat specifies the format to store the data.  "parquet" and "orc" are good choices with high compression; "text" is also supported.

The full (default) configuration options are as follows:

	tpcds {
    	scaleFactor = 1
    	dataLocation = "hdfs://127.0.0.1:9000/tpcds"
    	dataFormat = "parquet"
    	overwrite = false
    	partitionTables = true
    	useDoubleForDecimal = false
    	clusterByPartitionColumns = false
    	filterOutNullPartitionValues = false
    	numPartitions = 1000
    	usePartitionColumns = false
    }
	
We have provided a couple of useful command line utilities, which are generated into the folder `target/appassembler/bin`:

 - list-queries lists the available queries.  It takes zero or one arguments; with zero arguments, it lists the available benchmarks; with 1 argument, it either lists a benchmark, or prints a query.  Queries are broken down into benchmarks.  Since multiple people have implemented variants of the original TPC-DS queries, we have included multiple of these variants here.  The impala-tpcds-modified-queries are a set of 20 selected queries that several work has used for benchmarking previously with Spark.
 - dsdgen is a wrapper around the dsdgen utility that TPC provides.  This package comes with precompiled dsdgen binaries for Linux and Mac, which we use for data generation.