Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Java integers hashing in javaHash #41131

Merged

Conversation

JackyWoo
Copy link
Contributor

@JackyWoo JackyWoo commented Sep 9, 2022

Changelog category (leave one):

  • New Feature

Usage scenario

We use Spark to import data into CH local table and data is distributed by id.hashcode() % shard_num. And now we want to import via CH distributed table.

So we need a Java int hash function .

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support Java integers hashing in javaHash.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-feature Pull request with new product feature label Sep 9, 2022
@evillique evillique self-assigned this Sep 9, 2022
@evillique evillique added the can be tested Allows running workflows for external contributors label Sep 9, 2022
Copy link
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@alexey-milovidov alexey-milovidov self-assigned this Sep 10, 2022
@JackyWoo JackyWoo changed the title Add function javaIntHash Support unsigned Java integers hashing in javaHash Sep 13, 2022
@JackyWoo JackyWoo changed the title Support unsigned Java integers hashing in javaHash Support Java integers hashing in javaHash Sep 13, 2022
@evillique evillique removed their assignment Sep 14, 2022
@JackyWoo
Copy link
Contributor Author

@alexey-milovidov Pls review the PR, when you have a changce.

static ReturnType apply(int64_t x)
{
int64_t copy = x;
copy = copy >> 32;
Copy link
Member

@alexey-milovidov alexey-milovidov Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident about the correctness of this code:

For unsigned a and for signed and non-negative a, the value of a >> b is the integer part of a/2b.

For negative a, the value of a >> b is implementation-defined (in most implementations, this performs arithmetic right shift, so that the result remains negative).

https://en.cppreference.com/w/cpp/language/operator_arithmetic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A random page on the internet is saying that

>>> is an unsigned right shift operator

that is different from your code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. According the doc, Since C++20 >> is arithmetic right shift

Since C++20
The value of a >> b is a/2b, rounded down (in other words, right shift on signed a is arithmetic right shift).

  1. >>> is implemented by copy >> 32 and copy & 0x00000000FFFFFFFF

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

123
122
-539222985
-539222986
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any online playground or a quick code snippet to validate that this is definitely the same result as Java gives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I did not find any online tool. But we can generate Java hash code simplly by Long.hashCode().

{
const size_t size = sizeof(T);
const char * data = reinterpret_cast<const char *>(&x);
return apply(data, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure about this code.

Now we apply javaHash for unsigned integers as for strings for bytes in the native byte order.

How unsigned integers are typically represented in Java?
If they are represented as BigInt, let's check what the behavior should be.
If they don't exist, let's throw an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for advice. Java does not have unsigned types.

@alexey-milovidov
Copy link
Member

Ok, and let's update the test, and it is ready...

@alexey-milovidov
Copy link
Member

The test still has to be updated.

@JackyWoo
Copy link
Contributor Author

JackyWoo commented Oct 2, 2022

Still some test failures but may be unrelated.

@alexey-milovidov alexey-milovidov merged commit 0d1d177 into ClickHouse:master Oct 2, 2022
@JackyWoo JackyWoo deleted the add_function_java_int_hash branch October 22, 2022 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants