Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hashid support #37013

Merged
merged 21 commits into from
May 21, 2022
Merged

Add hashid support #37013

merged 21 commits into from
May 21, 2022

Conversation

mnutt
Copy link
Contributor

@mnutt mnutt commented May 7, 2022

Changelog category:

  • New Feature

Changelog entry:

Add support for calculating hashids from unsigned integers.

SELECT
    number,
    hashid(number, 's3cr3t', 16)
FROM system.numbers
LIMIT 5

┌─number─┬─hashid(number, 's3cr3t', 16)─┐
│      0 │ lw3nk9P0xPKdzDq0             │
│      1 │ vGVRrLe9Aek83z1A             │
│      2 │ lJVdwEjOKjKBOQNY             │
│      3 │ JM5Q0gPy9P7az4rk             │
│      4 │ EM9gY0po5jmWO5Rn             │
└────────┴──────────────────────────────┘

We use UInt64 ids internally but expose hashids to end users and it would be simpler and faster to have ClickHouse produce them.

This is my first time working in the ClickHouse codebase so I tried to write it based on how other functions are written however I wasn't totally sure if my approach to variadic function argument checking was the best way to to do it, and if there would be any problems with the method it uses to support const/non-const UInt8/UInt16/UInt32/UInt64.

I added some stateless tests, and once everything else looks good I can update the contrib docs.

@CLAassistant
Copy link

CLAassistant commented May 7, 2022

CLA assistant check
All committers have signed the CLA.

@mnutt
Copy link
Contributor Author

mnutt commented May 7, 2022

One open question I had was if the ENABLE_HASHIDSXX flags were necessary. (in which case I'll guard against it in the function registration)

@robot-clickhouse robot-clickhouse added pr-feature Pull request with new product feature submodule changed At least one submodule changed in this PR. labels May 7, 2022
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 7, 2022
@alexey-milovidov
Copy link
Member

The code looks good overall but it contains unnecessary copy from temporary object of std::string type. It may also lead to unnecessary allocation if string is larger than around 24 bytes.

@alexey-milovidov
Copy link
Member

If alphabet is known in compile time, division can be optimized: https://github.com/schoentoon/hashidsxx/blob/master/hashids.cpp#L114

@alexey-milovidov
Copy link
Member

@mnutt
Copy link
Contributor Author

mnutt commented May 9, 2022

@alexey-milovidov thank you for taking a look! So I understand about the unnecessary allocation, are you referring to the std::string created inside the for loop?

Regarding the alphabet, the default alphabet is known at compile time but the user can pass their own alphabet, is it worth special-casing the case where people use the default? It doesn't look like that string in https://github.com/schoentoon/hashidsxx/blob/master/hashids.cpp#L112 can ever grow past 16 characters so should get SSO right?

I did a little profiling and unsurprisingly the vast majority of time is spent in hashid's _reorder. I'm not sure if there's a whole lot I can do to optimize that, but I'm definitely open to suggestions.

@mnutt
Copy link
Contributor Author

mnutt commented May 14, 2022

Does anyone have ideas about these few build failures? As far as I can tell they seem to come from:

Error response from daemon: Cannot kill container: 067df0f9972a: tried to kill container, but did not receive an exit event
docker: Error response from daemon: Conflict. The container name "/clickhouse_integration_tests" is already in use by container "067df0f9972a103b2065db8c16962dde63ef59540d1ea5ec6f0a1f74128844d1". You have to remove (or rename) that container to be able to reuse that name.

Is this a flakiness issue, or is there an actual problem elsewhere?

@yakov-olkhovskiy yakov-olkhovskiy self-assigned this May 17, 2022
@yakov-olkhovskiy
Copy link
Member

yakov-olkhovskiy commented May 17, 2022

@mnutt I don't think ENABLE_HASHIDSXX is really necessary here.
Also errors in CI - I don't see anything related to this PR so far.

src/Functions/FunctionHashID.h Outdated Show resolved Hide resolved
src/Functions/FunctionHashID.h Show resolved Hide resolved
src/Functions/FunctionHashID.h Outdated Show resolved Hide resolved
contrib/hashidsxx-cmake/CMakeLists.txt Outdated Show resolved Hide resolved
@yakov-olkhovskiy
Copy link
Member

@mnutt and we have additional objective for this issue - we want to add setting allow_experimental_hash_functions, and put this function under this setting

@mnutt mnutt force-pushed the hashid branch 2 times, most recently from 623cdd9 to f6156c3 Compare May 19, 2022 00:04
newline at end of file added
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature submodule changed At least one submodule changed in this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants