Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat][io] Upgarde ClickHouse driver to support loadbalance policy #18774

Merged
merged 2 commits into from
Dec 21, 2022

Conversation

tisonkun
Copy link
Member

@tisonkun tisonkun commented Dec 7, 2022

This closes #12998.

Documentation

May or may not we add some docs on this feature. I'm not sure.

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

Signed-off-by: tison <wander4096@gmail.com>
@tisonkun tisonkun requested review from dlg99 and nlu90 December 7, 2022 00:10
@tisonkun tisonkun self-assigned this Dec 7, 2022
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 7, 2022
@codecov-commenter
Copy link

codecov-commenter commented Dec 7, 2022

Codecov Report

Merging #18774 (909e870) into master (b36e012) will decrease coverage by 1.30%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #18774      +/-   ##
============================================
- Coverage     47.74%   46.43%   -1.31%     
+ Complexity    10623    10425     -198     
============================================
  Files           703      703              
  Lines         68828    68851      +23     
  Branches       7381     7382       +1     
============================================
- Hits          32861    31974     -887     
- Misses        32307    33276     +969     
+ Partials       3660     3601      -59     
Flag Coverage Δ
unittests 46.43% <ø> (-1.31%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...er/metadata/v2/TransactionBufferSnapshotIndex.java 0.00% <0.00%> (-77.78%) ⬇️
...pache/pulsar/broker/tools/GenerateDocsCommand.java 0.00% <0.00%> (-77.28%) ⬇️
.../apache/pulsar/utils/CmdGenerateDocumentation.java 0.00% <0.00%> (-76.93%) ⬇️
.../metadata/v2/TransactionBufferSnapshotSegment.java 0.00% <0.00%> (-75.00%) ⬇️
...ourcegroup/ResourceUsageTopicTransportManager.java 0.00% <0.00%> (-74.79%) ⬇️
...ava/org/apache/pulsar/broker/tools/BrokerTool.java 0.00% <0.00%> (-71.43%) ⬇️
.../metadata/v2/TransactionBufferSnapshotIndexes.java 0.00% <0.00%> (-71.43%) ⬇️
...n/java/org/apache/pulsar/PulsarVersionStarter.java 0.00% <0.00%> (-42.31%) ⬇️
...che/pulsar/broker/resourcegroup/ResourceGroup.java 28.76% <0.00%> (-40.07%) ⬇️
...sar/broker/resourcegroup/ResourceGroupService.java 37.77% <0.00%> (-35.87%) ⬇️
... and 79 more

@@ -38,7 +38,7 @@
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>ru.yandex.clickhouse</groupId>
<groupId>com.clickhouse</groupId>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's better to use shaded jar?

com.clickhouse clickhouse-jdbc 0.3.2-patch11 all * *

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review @zhicwu!

What's the most common dependency conflict that can be avoided by using the shaded jar?

I can see using the basic JAR, the final connector NAR is in size 9.7M, and after switching to the shaded jar, it grows to 46M.

If no user reports the current solution's issue, I tend to keep it as is.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the most common dependency conflict that can be avoided by using the shaded jar?

LZ4 and Gson. Put native library aside, if there's custom classloader for isolation, it should not be a problem.

I can see using the basic JAR, the final connector NAR is in size 9.7M, and after switching to the shaded jar, it grows to 46M.

Did you exclude all dependencies when specifying classifier? I'm not familiar with the Nifi maven plugin, but the JDBC driver with only http implementation is ~1MB. -all contains more like gRPC etc., so it's 20+ MB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you exclude all dependencies when specifying classifier?

No. I'm unsure what dependencies should be excluded XD.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of them because you only need the shaded jar :)

<dependency>
    <groupId>com.clickhouse</groupId>
    <artifactId>clickhouse-jdbc</artifactId>
    <version>0.3.2-patch11</version>
    <classifier>all</classifier>
    <exclusions>
        <exclusion>
            <groupId>*</groupId>
            <artifactId>*</artifactId>
        </exclusion>
    </exclusions>
</dependency>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhicwu That's it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it reduces to 25M

@zhicwu
Copy link

zhicwu commented Dec 8, 2022

Hi @tisonkun, com.clickhouse/clickhouse-jdbc:0.3.2-patch11 is a complete rewrite, so not surprisingly it's not fully backward-compatible. It seems ClickHouseJdbcAutoSchemaSink is the only usage without any test. As a result, we cannot tell if anything broken from CI. So I'm listing key changes here to help you understand and estimate the impact of upgrading the driver:

  1. Apache Http Client 4.x was replaced by Http(s)URLConnection, meaning http properties and proxies are now considered which may cause problem within a complex runtime
  2. Data format between client and server was changed from TabSeparated to RowBinary, meaning we no longer can pass null to a non-nullable column
  3. Type mapping and timezone handling were changed as well, meaning same query may give us different results but I guess it should be mainly about timestamp with timezone
  4. Introduced non-standard table types in meta data: DICTIONARY, LOG TABLE, MEMORY TABLE, REMOTE TABLE, SYSTEM TABLE, TEMPORARY TABLE, so you may find some tables disappeared

Having said that, since v0.3.2 is packed with both legacy and new JDBC drivers(ru.yandex.clickhouse.ClickHouseDriver vs. com.clickhouse.jdbc.ClickHouseDriver), you may add an option to let user to choose during the transition period. Starting from v0.3.3, there'll be no legacy driver anymore.

Lastly, some more comments for you to consider regarding ClickHouse sink:

  • support nested data types like Array, Tuple, Map, JSON
  • ClickHouse supports multiple data formats and Avro is one of them, so maybe we don't have to convert Avro data type to JDBC type and then ClickHouse data type

@tisonkun
Copy link
Member Author

tisonkun commented Dec 8, 2022

@zhicwu Thanks for your explanation! I think most of the usages of ClickHouse JDBC Driver are managed within the Pulsar IO framework, so the impact should also be managed.

Still, it doesn't tell why we should use a shaded dependency here - what's the benefit? The NAR file size is a downsize, BTW.

@tisonkun tisonkun requested a review from nicoloboschi December 8, 2022 09:24
Copy link
Member

@nlu90 nlu90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some test to capture potential broken changes.

@tisonkun
Copy link
Member Author

@nlu90 Agree. Test coverage is lack for this connector. I'm trying to add some.

@tisonkun
Copy link
Member Author

Hi @zhicwu! Let me ask some questions in details:

Data format between client and server was changed from TabSeparated to RowBinary, meaning we no longer can pass null to a non-nullable column

What's the behavior if we pass null to a non-nullable column previously?

Type mapping and timezone handling were changed as well, meaning same query may give us different results but I guess it should be mainly about timestamp with timezone

Is there a CHANGELOG for the details how the mapping and handling change?

@zhicwu
Copy link

zhicwu commented Dec 16, 2022

What's the behavior if we pass null to a non-nullable column previously?

Before v0.3.2, it depends on table structure(whether the column has default value) and server setting(whether input_format_null_as_default is 1, which is the default). Starting from v0.3.2, you'll end up with a SQLException complaining you should not insert null into non-nullable column, unless client option nullAsDefault is set to 2.

Is there a CHANGELOG for the details how the mapping and handling change?

Unfortunately the answer is no. We use to have a CHANGELOG but it did not cover details like that. As I mentioned above, the impact to JDBC driver should be mainly about timestamp with timezone.

@tisonkun & @nlu90, with no doubt that adding tests is needed. However, I'd suggest to start with acceptance tests from pulsar's point of view, instead of being distracted by compatibility issue of specific driver ;)

@tisonkun
Copy link
Member Author

@zhicwu Thanks for your explanation! I think it's good to merge now.

Merging...

@codelipenghui @eolivelli @Technoboy- FYI release note required on the next release.

@tisonkun tisonkun merged commit 5ea6790 into apache:master Dec 21, 2022
@tisonkun tisonkun deleted the clickhouse-loadbalance branch December 21, 2022 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pulsar IO support Clickhouse Load Balance
4 participants