Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-3922][CH] Fix incorrect shuffle hash id value when executing modulo #3923

Merged
merged 1 commit into from
Dec 7, 2023

Conversation

zzcclp
Copy link
Contributor

@zzcclp zzcclp commented Dec 5, 2023

What changes were proposed in this pull request?

Fix incorrect shuffle hash id value when executing modulo. In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close #3922.

(Fixes: #3922)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@zzcclp zzcclp requested a review from liuneng1994 December 5, 2023 06:16
Copy link

github-actions bot commented Dec 5, 2023

#3922

Copy link

github-actions bot commented Dec 5, 2023

Run Gluten Clickhouse CI

auto res = hash_int32 % parts_num_int32;
if (res < 0)
{
res += parts_num_int32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using pmod and cast function in CH. Do not repeat already implemented logcis.

Copy link

github-actions bot commented Dec 5, 2023

Run Gluten Clickhouse CI

@zzcclp
Copy link
Contributor Author

zzcclp commented Dec 5, 2023

perf tests:
main + cityHash64: 127, 121 (mean, min)
main + murmur3: 123, 118
pr-3923 + cityHash64: 129,121
pr-3923 + murmur3: 124,118

Copy link

github-actions bot commented Dec 5, 2023

Run Gluten Clickhouse CI

1 similar comment
@zzcclp
Copy link
Contributor Author

zzcclp commented Dec 6, 2023

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Dec 6, 2023

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Dec 6, 2023

Run Gluten Clickhouse CI

2 similar comments
Copy link

github-actions bot commented Dec 6, 2023

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Dec 6, 2023

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Dec 6, 2023

Run Gluten Clickhouse CI

1 similar comment
Copy link

github-actions bot commented Dec 6, 2023

Run Gluten Clickhouse CI

liuneng1994
liuneng1994 previously approved these changes Dec 7, 2023
Copy link
Contributor

@liuneng1994 liuneng1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zzcclp
Copy link
Contributor Author

zzcclp commented Dec 7, 2023

perf tests:
main + cityHash64: 134, 120 (mean, min)
main + murmur3: 127, 118
pr-3923 + cityHash64: 121,118
pr-3923 + murmur3: 122,122

…modulo

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
Copy link

github-actions bot commented Dec 7, 2023

Run Gluten Clickhouse CI

@zzcclp
Copy link
Contributor Author

zzcclp commented Dec 7, 2023

perf tests:
main + cityHash64: 133, 123 (mean, min)
pr-3923 + murmur3: 121,117

Copy link
Contributor

@baibaichen baibaichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@baibaichen baibaichen merged commit 9c5314f into apache:main Dec 7, 2023
7 checks passed
loneylee pushed a commit to loneylee/gluten that referenced this pull request Dec 7, 2023
…modulo (apache#3923)

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
loneylee pushed a commit to loneylee/gluten that referenced this pull request Dec 7, 2023
…modulo (apache#3923)

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
loneylee pushed a commit to loneylee/gluten that referenced this pull request Dec 7, 2023
…modulo (apache#3923)

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
loneylee pushed a commit to loneylee/gluten that referenced this pull request Dec 7, 2023
…modulo (apache#3923)

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
loneylee pushed a commit to loneylee/gluten that referenced this pull request Dec 7, 2023
…modulo (apache#3923)

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
loneylee pushed a commit to loneylee/gluten that referenced this pull request Dec 7, 2023
…modulo (apache#3923)

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
loneylee pushed a commit to loneylee/gluten that referenced this pull request Dec 7, 2023
…modulo (apache#3923)

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
loneylee pushed a commit to loneylee/gluten that referenced this pull request Dec 7, 2023
…modulo (apache#3923)

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
loneylee pushed a commit to loneylee/gluten that referenced this pull request Dec 7, 2023
…modulo (apache#3923)

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
loneylee pushed a commit to loneylee/gluten that referenced this pull request Dec 7, 2023
…modulo (apache#3923)

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_master_12_07_2023_time.csv log/native_master_12_06_2023_d6b1e9138_time.csv difference percentage
q1 33.97 33.13 -0.839 97.53%
q2 24.95 25.88 0.936 103.75%
q3 37.91 36.12 -1.793 95.27%
q4 37.39 38.73 1.336 103.57%
q5 71.30 72.19 0.881 101.24%
q6 5.38 7.01 1.635 130.39%
q7 85.81 84.85 -0.963 98.88%
q8 85.80 87.08 1.279 101.49%
q9 123.51 127.07 3.562 102.88%
q10 45.87 44.57 -1.293 97.18%
q11 20.29 20.26 -0.032 99.84%
q12 26.73 21.15 -5.579 79.13%
q13 45.73 48.39 2.661 105.82%
q14 16.69 18.86 2.168 112.98%
q15 28.46 27.18 -1.278 95.51%
q16 15.56 15.92 0.370 102.38%
q17 103.37 102.54 -0.832 99.19%
q18 149.01 152.02 3.011 102.02%
q19 12.84 13.90 1.055 108.21%
q20 27.78 27.91 0.128 100.46%
q21 223.73 223.84 0.104 100.05%
q22 13.39 13.24 -0.146 98.91%
total 1235.47 1241.84 6.370 100.52%

loneylee pushed a commit to loneylee/gluten that referenced this pull request Dec 8, 2023
…modulo (apache#3923)

Fix incorrect shuffle hash id value when executing modulo.
In CH Backend, the data type of the shuffle split num is a UInt32 and the returned type of the hash function is a UInt64, when the returned value of the hash function is more than 2^31 - 1, the modulo value of the hash value and the shuffle split num is different from the one of the vanilla spark.

Close apache#3922.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Fix incorrect shuffle hash id value when executing modulo
5 participants