This is a derived work from https://github.com/hortonworks/hive-testbench
This benchmark requires Apache Kylin 2.0 or later. The goal at this stage is functionality rather than performance. Enabling all TPC-H queries is a new milestone of Apache Kylin's SQL capability.
-
Prepare TPCH test data
- Follow the steps to setup test data in Hive.
-
For example, generate a minimal data with ScaleFactor=2
git clone https://github.com/hortonworks/hive-testbench.git cd hive-testbench/ ./tpch-build.sh ./tpch-setup.sh 2
-
Verify the TPCH tables existed in hive
echo "use tpch_flat_orc_2; show tables;" | hive
-
- Follow the steps to setup test data in Hive.
-
Import Kylin model
-
Get this repository.
git clone https://github.com/Kyligence/kylin-tpch.git
-
Make sure there is a running Kylin instance.
export KYLIN_HOME=...
-
Run the
setup-kylin-model.sh
with the SAME SCALE FACTOR to generate hive data, for example:cd kylin-tpch ./setup-kylin-model.sh 2
-
The script also creates a few simple views on top of the original TPCH tables to allow Kylin pre-calculate some complex measures. The resulted E-R model topology is identical to the original TPCH model.
-
-
Build Kylin cube
-
In the Kylin web, click "System->Reload Metadata" to refresh the newly imported TPCH model.
-
Select project "tpch" and find the 4 cubes below.
- lineitem_cube
- partsupp_cube
- customer_cube
- customer_vorder_cube
-
lineitem_cube
andpartsupp_cube
together covers 20 TPCH queries out of the total 22. Besides,customer_vorder_cube
is for query 13 andcustomer_cube
is for query 22. -
Build all cubes within the year range from 1992 to 1998.
-
Verify by counting
v_lineitem
in Kylin. The row count must match the Hive tablelineitem
.select count(*) from v_lineitem
-
-
Run TPCH queries
- Find the TPCH queries under the
query
folder. Run them via the Kylin web, or Kylin REST API / JDBC / ODBC. - The queries are slightly adapted to leverage the views created in Step 2 and work around limitations like KYLIN-2341.
- The original queries (from hive-testbench) can be found at
queries/original-queries
. The original and the adapted queries should give the same result in Hive and Kylin respectively. - There is a tool to run all queries one by one. It can be found at
tools/query-tool.py
. For Kylin, it prints query duration. For Hive, it prints query duration and query results.- Install dependencies
pip install requests
- Run kylin query
python tools/query-tool.py -s http://127.0.0.1:7070/kylin -u ADMIN -p KYLIN -d queries -o tpch -r 3 -t kylin
- Run hive query
python tools/query-tool.py -t hive -c 10
- Find the TPCH queries under the
This benchmark was tested in two environments. One enviroment is HDP sandbox, another is a 4-node CDH cluster. Each query was run multiple times and the final stable response time get recorded for both Kylin and Hive.
And again, the goal at this stage is SQL capability rather than performance. No special performance tuning was done to both Kylin and Hive.
Btw, KyBot is a great tool if you want to analyze Kylin cube and query performance. Upload a Kylin diagnosis pack to this online service and it can show you why a query runs slow or why a cube takes long to build. KyBot is free for trial.
Cube Build Stats | HDP Sandbox (SF=2) | CDH 4-Node Cluster (SF=10) |
---|---|---|
lineitem_cube | 20.1 GB, 125 mins * 7 | 85.8 GB, 54.69 mins * 7 |
partsupp_cube | 1.6 GB, 27.02 mins | 8.3 GB, 18.48 mins |
customer_vorder_cube | 283 MB, 16.88 mins | 1.39 GB, 13.7 mins |
customer_cube | 7 MB, 3.95 mins | 36.98 MB, 3.97 mins |
(The lineitem_cube was built incrementally. Each increment builds 1 years data.)
Query Response Time | Kylin on HDP (SF=2) | Hive+Tez on HDP (SF=2) | Kylin on CDH (SF=10) | Hive+MR on CDH (SF=10) |
---|---|---|---|---|
query01 | 0.43 sec | 13.79 sec | 0.45 sec | 43.45 sec |
query02 | 8.87 sec | 17.04 sec | 8.77 sec | 3570 sec |
query03 | 5.2 sec | 16.32 sec | 4.84 sec | 137.4 sec |
query04 | 0.71 sec | 31.23 sec | 2.69 sec | 90.99 sec |
query05 | 0.33 sec | 23.68 sec | 0.47 sec | 162.79 sec |
query06 | 0.34 sec | 8.91 sec | 0.39 sec | 23.29 sec |
query07 | 0.17 sec | 35.23 sec | 0.24 sec | 245.52 sec |
query08 | 7.42 sec | 19.4 sec | 8.82 sec | 2248.5 sec |
query09 | 10.4 sec | 30.75 sec | 34.8 sec | 19869 sec |
query10 | 21.1 sec | 19.03 sec | 11.91 sec | 120.61 sec |
query11 | 3.42 sec | 15.87 sec | 2.09 sec | 123.3 sec |
query12 | 7.66 sec | 12.64 sec | 5.73 sec | 96.34 sec |
query13 | 20.44 sec | 24.61 sec | 104.77 sec | 135.64 sec |
query14 | 1.1 sec | 4.54 sec | 1.56 sec | 54.42 sec |
query15 | 1.96 sec | 13.06 sec | 17.02 sec | 124.29 sec |
query16 | 3.69 sec | 17.36 sec | 10.16 sec | 144.34 sec |
query17 | 5.49 sec | 25.61 sec | 10.19 sec | 183.56 sec |
query18 | 22.43 sec | 65.32 sec | 74.88 sec | 231.9 sec |
query19 | 21.9 sec | 22.5 sec | 138.81 sec | 186.87 sec |
query20 | 6.46 sec | 16.21 sec | 42.02 sec | 229.01 sec |
query21 | 58.6 sec | 84.6 sec | 136.3 sec | 276.59 sec |
query22 | 23.21 sec | 21.79 sec | 102.81 sec | 154.48 sec |
HDP Sandbox Configuration
- HDP sandbox (version 2.4.0.0-169) VM on VirtualBox
- 4 vcores, 10 GB memory, ScaleFactor=2
- Because of the limited resource, in depth tuning is required to run through the benchmark
- More resource to Tez
- hive.tez.java.opts -Xmx=512m
- tez container heap 1G
- More resource to HBase
- hbase region server heap 4G
- Kylin be easy on HBase
- kylin.storage.hbase.coprocessor-timeout-seconds=3600
- kylin.storage.hbase.max-hconnection-threads=5
- kylin.storage.hbase.core-hconnection-threads=5
- kylin.storage.hbase.coprocessor-mem-gb=30
- kylin.storage.partition.aggr-spill-enabled=false
- kylin.storage.hbase.compression-codec=snappy
- kylin.query.scan-threshold=1000000000
- YARN config
- 5 GB memory, 3 vcores
- MR config
- mapreduce.job.reduce.slowstart.completedmaps=1
- enable compression org.apache.hadoop.io.compress.SnappyCodec
CDH 4-Node Cluster Configuration
- Cloudera 5.8.3
- 1 Master (32 vcores, 192 GB memory), 3 Slave (32 vcores, 96 GB memory)
- ScaleFactor = 10
- YARN config
- 200 GB memory, 128 vcores
- Kylin config
- kylin.storage.hbase.compression-codec=snappy
- kylin.query.scan-threshold=1000000000
- kylin.storage.hbase.coprocessor-mem-gb=30
- set kylin JVM -Xms4g -Xmx8g