-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ClickBench in bench.sh #7005
Conversation
28c17f3
to
3aebb88
Compare
035c66e
to
63e59d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for adding the benchmark shell for ClickBench! Looks good to me! I have some questions about the echo message below. They are not blocking, just want to understand them
if test "${OUTPUT_SIZE}" = "14779976446"; then | ||
echo -n "... found ${OUTPUT_SIZE} bytes ..." | ||
else | ||
URL="https://datasets.clickhouse.com/hits_compatible/hits.parquet" | ||
echo -n "... downloading ${URL} (14GB) ... " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do you mean the other way? Since
OUTPUT_SIZE
is constant in the if statement and dynamic in the else statement
if test "${OUTPUT_SIZE}" = "14779976446"; then | |
echo -n "... found ${OUTPUT_SIZE} bytes ..." | |
else | |
URL="https://datasets.clickhouse.com/hits_compatible/hits.parquet" | |
echo -n "... downloading ${URL} (14GB) ... " | |
if test "${OUTPUT_SIZE}" = "14779976446"; then | |
echo -n "... downloading ${URL} (14GB) ... " | |
else | |
URL="https://datasets.clickhouse.com/hits_compatible/hits.parquet" | |
echo -n "... found ${OUTPUT_SIZE} bytes ..." |
- Just curious, why the unit is
byte
in line 353 andKB
in line 376? Should they be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, they should both by bytes (KB was left over from a previous implementation). Nice eyes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the other way? Since OUTPUT_SIZE is constant in the if statement and dynamic in the else statement
I think this is the right way -- basically this code is testing if a file with the expected size already exists, and if so it doesn't download it again.
This means that you can run bench.sh download clickbench_1
and be sure you have the most recent data but not have to download all 14GB if it already exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that you can run
bench.sh download clickbench_1
and be sure you have the most recent data but not have to download all 14GB if it already exists
Ah, got it! Thank you for the explanation
Thank you very much for the review @appletreeisyellow |
Which issue does this PR close?
Part of #6994
Rationale for this change
See #6994
What changes are included in this PR?
Add support in bench.sh for downloading the ClickBench dataset (both single file as well as partitioned)
You can download it now via:
Example
I also plan to update the runner so it can run the clickbench queries, but will do so as a follow on PR
Are these changes tested?
Tested manually
Are there any user-facing changes?
No, this is a development tool