Credits: Most scripts have been referenced from Fivetran DW Benchmark and have been adapted to suit our particular usecase.
- Move
dsdgen
to a GCS bucket to a specific location as mentioned in the bootstrap script - Create a High CPU
VM
eg. 16vCPU - Clone this repository
git clone $REPO_URL
- Give all script files executable permission
chmod +x *.sh
- Run
bootstrap.sh
- This pulls
dsdgen
binary - Installs Google Fuse; this is to mount GCS bucket as a local folder - More info
- This pulls
- Run
data_gen.sh
Usage:./data_gen.sh $CPU $SCALE
- This is responsible for generating data
$CPU
denotes the amount of parallelism must be > 1$SCALE
denotes the scale of data that needs to be generated- This creates and mounts a GCS Bucket and writes data to it
- NOTE: Ensure that
$CPU
is close to number of CPUs in VM for efficient parallel generation
- Run
load_data.sh
Usage:./load_data.sh $SCALE
- This is responsible of loading data in GCP buckets created in step 5 to BigQuery
$SCALE
denotes the scale of data that needs to be loaded to BigQuery- Note: Before running this step ensure that data is generated and present in the appropriate GCS Bucket
- Run
benchmark.sh
Usage:./benchmark.sh $SCALE
- This is responsible for running TPC-DS queries and measuring query execution time
- Generates a
csv
file inresults
folder containing the querystart_time
andend_time
- Saves query statistics in the same directory