[FEA] Create NDS-H benchmark for performance analysis #182

mattahrens · 2024-03-06T16:26:12Z

I would like to add another benchmark to the repository to support additional workloads for comparison. The TPC-H benchmark is used by different partners for comparison so we can enable the execution of a TPC-H similar workload benchmark. The requirements are similar to what we have for NDS:

Data generation

P0: Support generation of raw data at various scale factors
P0: Support conversion of raw data to Parquet
P1: Support conversion of raw data to ORC
P1: Support conversion of raw data to CSV

Query generation

P0: Support generation of queries at various scale factors

Power run execution

P0: Support execution of full query set given a specified input path
P1: Support execution of individual query given a specific query and input path

We can add additional requirements once the initial NH scripts are set up to more closely match how we execute NDS.

Relevant links of other repos that execute TPC-H workloads:

Disclaimers for TPC-H:

TPC-H is Copyright © 1993-2024 Transaction Processing Performance Council. The full TPC-H specification in PDF format can be found here
TPC, TPC Benchmark, and TPC-H are trademarks of the Transaction Processing Performance Council.

wjxiz1992 · 2024-03-13T06:36:36Z

Hi Matt, after some discussion with @GaryShen2008 several things to confirm:

Do we need to latest TPC-H tool version?
If we want to be able to execute the whole TPC-H benchmark as soon as possible, we can leverage https://github.com/databricks/spark-sql-perf?tab=readme-ov-file#tpc-h directly to generate TPC-H data and run TPC-H queries. But note the TPC-H version it uses is still v2.4.0 while the latest is v3.0.1. I do see a bunch of patches in the TPC-H specifications PDF file so I think it's an issue.
If we want to use the latest TPC-H tool, the effort will be similar to the one for NDS.
code structure change
There're some NDS specific code like https://github.com/NVIDIA/spark-rapids-benchmarks/blob/dev/nds/nds_gen_data.py#L42-L68 but also a lot of general code like https://github.com/NVIDIA/spark-rapids-benchmarks/blob/dev/nds/nds_power.py#L125-L135. now all of them are under NDS folder. If we want good looking code, a refactor will be necessary. but if we want short-term goad, for example, we want to be able to run TPC-H ASAP, we can just create an NDH folder, and put in existing code like https://github.com/databricks/spark-sql-perf/blob/master/src/main/notebooks/TPC-multi_datagen.scala along with some simple wrapper code to make it work.

These are the current gaps we see according to previous related work.

mattahrens · 2024-03-13T13:56:40Z

Yes, let's use the latest version of the TPC-H tool version. I believe the other repo links I provided in the issue description may be using the latest version.
Let's start with just bringing up NH benchmark and then we can refactor to have common utilities between NDS and NH.

mattahrens added the ? - Needs Triage Need team to review and classify label Mar 6, 2024

GaryShen2008 assigned yinqingh Mar 26, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 10, 2024

mattahrens changed the title ~~[FEA] Create NH benchmark for performance analysis~~ [FEA] Create NDS-H benchmark for performance analysis May 21, 2024

bilalbari assigned bilalbari and unassigned yinqingh May 22, 2024

wjxiz1992 mentioned this issue Jun 4, 2024

Tpc-H feature branch #187

Merged

mattahrens unassigned bilalbari Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Create NDS-H benchmark for performance analysis #182

[FEA] Create NDS-H benchmark for performance analysis #182

mattahrens commented Mar 6, 2024 •

edited

Loading

wjxiz1992 commented Mar 13, 2024

mattahrens commented Mar 13, 2024

[FEA] Create NDS-H benchmark for performance analysis #182

[FEA] Create NDS-H benchmark for performance analysis #182

Comments

mattahrens commented Mar 6, 2024 • edited Loading

wjxiz1992 commented Mar 13, 2024

mattahrens commented Mar 13, 2024

mattahrens commented Mar 6, 2024 •

edited

Loading