-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Java integers hashing in javaHash
#41131
Support Java integers hashing in javaHash
#41131
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
javaHash
javaHash
javaHash
@alexey-milovidov Pls review the PR, when you have a changce. |
src/Functions/FunctionsHashing.h
Outdated
static ReturnType apply(int64_t x) | ||
{ | ||
int64_t copy = x; | ||
copy = copy >> 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident about the correctness of this code:
For unsigned a and for signed and non-negative a, the value of a >> b is the integer part of a/2b.
For negative a, the value of a >> b is implementation-defined (in most implementations, this performs arithmetic right shift, so that the result remains negative).
https://en.cppreference.com/w/cpp/language/operator_arithmetic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A random page on the internet is saying that
>>>
is an unsigned right shift operator
that is different from your code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- According the doc, Since C++20
>>
is arithmetic right shift
Since C++20
The value of a >> b is a/2b, rounded down (in other words, right shift on signed a is arithmetic right shift).
>>>
is implemented bycopy >> 32
andcopy & 0x00000000FFFFFFFF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
123 | ||
122 | ||
-539222985 | ||
-539222986 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any online playground or a quick code snippet to validate that this is definitely the same result as Java gives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I did not find any online tool. But we can generate Java hash code simplly by Long.hashCode()
.
{ | ||
const size_t size = sizeof(T); | ||
const char * data = reinterpret_cast<const char *>(&x); | ||
return apply(data, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not sure about this code.
Now we apply javaHash for unsigned integers as for strings for bytes in the native byte order.
How unsigned integers are typically represented in Java?
If they are represented as BigInt, let's check what the behavior should be.
If they don't exist, let's throw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for advice. Java does not have unsigned types.
Ok, and let's update the test, and it is ready... |
The test still has to be updated. |
Still some test failures but may be unrelated. |
Changelog category (leave one):
Usage scenario
We use Spark to import data into CH local table and data is distributed by
id.hashcode() % shard_num
. And now we want to import via CH distributed table.So we need a Java int hash function .
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Support Java integers hashing in
javaHash
.