Project Themis is an open benchmarking framework for big data solutions.
Themis consists of 5 layers:
Themis spec define necessary component for benchmarking, such as generative dataset, benchmark workflow and reporting. There are implementations of the spec in different languages such as Python and Java.
Artifacts are materialized implementation of Themis spec that are required for benchmark execution, such as a Jar for Spark data generation application or a text file for a set of SQL queries to benchmark against.
A Python CLI is used to orchestrate each step in a benchmark execution.
Apache Airflow is used to define a complete benchmark workflow as a DAG, and automate the orchestration by directly invoking the CLI to execute commands in the DAG.
AWS CDK integration is provided for bootstrapping a Themis benchmark environment on AWS.
Build shadow jar:
./gradlew :themis-spark-iceberg:shadowJar
Upload to S3:
aws s3 cp spark-iceberg/build/libs/themis-spark-iceberg-0.1.0-all.jar s3://yzhaoqin-iceberg-test/themis/themis-spark-iceberg-0.1.0-all.jar
Create EMR Spark cluster, make sure Glue for Spark and Hive catalog is enabled.
Then run EMR step using command-runner.jar
with the following configs input:
spark-submit
--class io.themis.spark.iceberg.IcebergDatagen
s3://yzhaoqin-iceberg-test/themis/themis-spark-iceberg-0.1.0-all.jar
--gen random
--database-name themis_simple
--database-location s3://yzhaoqin-iceberg-test/themis/themis_simple
--drop-database-if-exists
--parallelism 100
--row-count 10000
--replace-tables
You will see 3 tables corresponding to the ones defined in io.themis.datagen.SimpleTables