Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ClickBench in bench.sh #7005

Merged
merged 2 commits into from
Jul 20, 2023
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 17, 2023

Which issue does this PR close?

Part of #6994

Rationale for this change

See #6994

The ClickBench benchmarks are both widely used in the industry and focus on grouping / aggregation / filtering: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

What changes are included in this PR?

Add support in bench.sh for downloading the ClickBench dataset (both single file as well as partitioned)

You can download it now via:

./bench.sh data clickbench_1 # single parquet file
./bench.sh data clickbench_partitioned # 100 parquet file partitioned dataset

Example

./bench.sh data clickbench_partitioned
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: clickbench_partitioned
DATA_DIR: /home/alamb/arrow-datafusion/benchmarks/data
CARGO_COMMAND: cargo run --profile release-nonlto
***************************
Checking hits_partitioned... downloading with 10 parallel workers.................................................................................................... Done

I also plan to update the runner so it can run the clickbench queries, but will do so as a follow on PR

./bench.sh run clickbench_1 # single parquet file
./bench.sh run clickbench_partitioned # 100 parquet file partitioned dataset

Are these changes tested?

Tested manually

Are there any user-facing changes?

No, this is a development tool

@github-actions github-actions bot added the core Core DataFusion crate label Jul 17, 2023
@github-actions github-actions bot removed the core Core DataFusion crate label Jul 17, 2023
@alamb alamb changed the title Alamb/clickbench data Add support for ClickBench in bench.sh Jul 17, 2023
@alamb alamb added the development-process Related to development process of DataFusion label Jul 17, 2023
@alamb alamb marked this pull request as ready for review July 17, 2023 17:46
Copy link
Contributor

@appletreeisyellow appletreeisyellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for adding the benchmark shell for ClickBench! Looks good to me! I have some questions about the echo message below. They are not blocking, just want to understand them

Comment on lines +352 to +356
if test "${OUTPUT_SIZE}" = "14779976446"; then
echo -n "... found ${OUTPUT_SIZE} bytes ..."
else
URL="https://datasets.clickhouse.com/hits_compatible/hits.parquet"
echo -n "... downloading ${URL} (14GB) ... "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do you mean the other way? Since OUTPUT_SIZE is constant in the if statement and dynamic in the else statement
Suggested change
if test "${OUTPUT_SIZE}" = "14779976446"; then
echo -n "... found ${OUTPUT_SIZE} bytes ..."
else
URL="https://datasets.clickhouse.com/hits_compatible/hits.parquet"
echo -n "... downloading ${URL} (14GB) ... "
if test "${OUTPUT_SIZE}" = "14779976446"; then
echo -n "... downloading ${URL} (14GB) ... "
else
URL="https://datasets.clickhouse.com/hits_compatible/hits.parquet"
echo -n "... found ${OUTPUT_SIZE} bytes ..."
  1. Just curious, why the unit is byte in line 353 and KB in line 376? Should they be the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, they should both by bytes (KB was left over from a previous implementation). Nice eyes

Copy link
Contributor Author

@alamb alamb Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the other way? Since OUTPUT_SIZE is constant in the if statement and dynamic in the else statement

I think this is the right way -- basically this code is testing if a file with the expected size already exists, and if so it doesn't download it again.

This means that you can run bench.sh download clickbench_1 and be sure you have the most recent data but not have to download all 14GB if it already exists

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that you can run bench.sh download clickbench_1 and be sure you have the most recent data but not have to download all 14GB if it already exists

Ah, got it! Thank you for the explanation

benchmarks/bench.sh Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the development-process Related to development process of DataFusion label Jul 20, 2023
@alamb
Copy link
Contributor Author

alamb commented Jul 20, 2023

Thank you @alamb for adding the benchmark shell for ClickBench! Looks good to me! I have some questions about the echo message below. They are not blocking, just want to understand them

Thank you very much for the review @appletreeisyellow

@alamb alamb merged commit eb9a702 into apache:main Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants