Support multiple zookeeper quorum to store cluster-management-configu… #196

rdhabalia · 2017-02-08T19:36:02Z

…ration and ledger-metadata separately

Motivation

Pulsar is using a single zookeeper ensemble to store cluster management and ledger information. When the number of topics hosted by Pulsar reaches significantly high ( > 1million), zookeeper data footprint and requests sent to zookeeper increases, especially during cold restart and sometime it affects broker's availability.
Since the data needed to keep the cluster up and running is relatively small (clusterManagementConfiguration) compared to the ledger information, using a separate zookeeper ensemble to store this information will increase the availability of the cluster.

Modifications

Added optional dataZookeeperServers to service configuration to introduce separate dataZk that stores ledger meta-data separately into provided dataZk.

Result

No impact to existing Pulsar configuration cluster which uses only one ensemble
Broker can store cluster-management-config and ledger-metadata into two separate provided zk
Pulsar service should come up even if the data-zk is not available and Pulsar should continue to serve requests even if data-zk fails when Pulsar service is up.

saandrews · 2017-02-09T21:49:46Z

pulsar-broker-common/src/main/java/com/yahoo/pulsar/broker/ServiceConfiguration.java

@@ -906,6 +924,14 @@ public void setZooKeeperSessionTimeoutMillis(long zooKeeperSessionTimeoutMillis)
        this.zooKeeperSessionTimeoutMillis = zooKeeperSessionTimeoutMillis;
    }

+    public long getDataZooKeeperSessionTimeoutMillis() {
+        return dataZooKeeperSessionTimeoutMillis;


Shouldn't we return zooKeeperSessionTimeout if dataZookeeperServers is empty?

actually we are initializing zooKeeperSessionTimeoutMillis=30000 and dataZooKeeperSessionTimeoutMillis=60000 with default value.

saandrews · 2017-02-09T22:07:26Z

pulsar-broker/src/main/java/com/yahoo/pulsar/PulsarClusterMetadataSetup.java

@@ -53,7 +54,11 @@

        @Parameter(names = { "-zk",
                "--zookeeper" }, description = "Local ZooKeeper quorum connection string", required = true)
-        private String zookeeper;
+        private String localZookeeper;


Let us not use zookeeper and localZookeeper interchangeably. ServiceConfiguration refers to zookeeper. Let us stick to one naming convention.

Sure. Updated it to zookeeper.

…ration and ledger-metadata separately

saandrews · 2017-02-15T23:38:27Z

👍 Can you check the build?

merlimat · 2017-02-17T19:47:29Z

@rdhabalia I'll get to review this one soon...

merlimat · 2017-03-05T17:45:34Z

The bigger question I have is how to take advantage of this, meaning which changes would be needed to "survive" a quorum loss on the data ZK and how to make sure that everything is back in order in every condition. :)

Anyway, is this still a big concern, given that apache/zookeeper#157 has been merged and will be available in zk-3.4.10 ?

rdhabalia · 2017-03-06T05:25:06Z

Here, the problem which we are trying to fix is: Prevent cold-restart of broker due to dataZK's large snapshot size:

As dataZk has larger snapshot size and many times zk-session times out at broker due to zk-gc or leader-election. It causes cold-restart at brokers and all clients try to reconnect at the same time which forces all the brokers to load all topics at the same time and it again creates significant back-pressure at zookeeper which again may cause quorum loss in zk.

Therefore, if we have separate dataZK and clusterManagementZK then dataZk will not cause broker restart and pulsar-broker can still survive without dataZk.

how to make sure that everything is back in order in every condition.

as dataZk is down, all clients fail while publish/consume and creation of producer/consumer. So, there is no data loss however, when dataZk comes back that's the only signal we have to make sure everything is back in order and probably monitoring such as SLA-monitoring can help to confirm broker's stability.

Anyway, is this still a big concern, given that apache/zookeeper#157 has been merged and will be available in zk-3.4.10 ?

I think this ZK-patch will help to reduce leader-election time and it will prevent zk-session timeout at broker and also prevent broker-restart. This will definitely help in this problem statement. But, I think it's always good to protect broker against zk-session timeout.

merlimat · 2017-03-06T05:40:26Z

Therefore, if we have separate dataZK and clusterManagementZK then dataZk will not cause broker restart and pulsar-broker can still survive without dataZk.

True, but that's not going to change much, you still cannot survive the quorum loss on clusterManagementZK, plus there need to be significant changes in order to continue operating when loosing dataZk quorum.

as dataZk is down, all clients fail while publish/consume and creation of producer/consumer. So,
there is no data loss however, when dataZk comes back that's the only signal we have to make
sure everything is back in order and probably monitoring such as SLA-monitoring can help to
confirm broker's stability.

The tricky part is how to come back and have everything in sync between broker and ZK quorum. Eg: version numbers for transactions that were caught in-flight during the quorum loss.

I think this ZK-patch will help to reduce leader-election time and it will prevent zk-session timeout > at broker and also prevent broker-restart. This will definitely help in this problem statement. But, I > think it's always good to protect broker against zk-session timeout.

The point there is that, by making the leader election not dependent on the snapshot size, the chances of loosing one or the other ZK quorum are not that different.

The only difference at that point would be the memory footprint of the quorum and (apart from configuring the JVM heap accordingly), the easiest way to improve the conditions there would be to use binary protobuf for the z-nodes content.

Even just doing that for ML and cursors (and at least initially, not for BK ledgers) would considerably reduce the ZK data set size. I would expect the same info to take 1/10th of size in binary vs text format.

rdhabalia · 2017-03-07T06:34:50Z

Ok. few more points on it:

True, but that's not going to change much, you still cannot survive the quorum loss on clusterManagementZK

I think our main concern is larger footprint causes instability (larger gc) and quorum-loss in zk which impacts broker's cold-restart and it has recursive negative impact on both zk and broker. So, clusterManagementZK will just store the ephemeral-nodes such as bundle-ownership and loadbalancer information, which creates smallest snapshot size and that's the main reason to prevent zk quorum loss.

The tricky part is how to come back and have everything in sync between broker and ZK quorum. Eg: version numbers for transactions that were caught in-flight during the quorum loss.

This should be fix by reading latest zk-data and refreshing the version when broker sees bad-version when system comes back.

msb-at-yahoo · 2017-03-07T18:33:25Z

Has anyone estimated the impact on snapshot size for workloads people care about as proposed in #281 (comment)?

merlimat · 2017-03-07T18:45:12Z

Also, it would be good to record worse case GC pauses in ZK servers.

My suspect is that, in a correctly configured machine (no swap, heap large enough for data-set), the GC pauses would be an order of magnitude less than what is needed to cause a zk server to be dropped from quorum.

merlimat · 2017-03-07T18:46:50Z

Also, the risk of #281, compared to the risk involved in transitioning to the 2 zk quorums is way lower.

saandrews · 2017-03-27T17:47:07Z

@merlimat, Can we revisit this? We could still use this feature when we setup new cluster.

…ache#196) * adding check for failures subroutine and fixing worker delete bug * adding license header

sijie · 2018-12-27T09:58:32Z

@rdhabalia are you still working on this pull request? or can it be closed?

* Fix consumer not found Signed-off-by: xiaolong.ran <rxl@apache.org> * fix ci error Signed-off-by: xiaolong.ran <rxl@apache.org>

Master issue: apache#138 This PR adds tests for images of apache#195 . In addition, it fixes the local test error when running tests on MacOS. Because the containers cannot connect to the host's KoP or Kafka service listened on 127.0.0.1. * Fix local tests error by listening to site local address * Add tests for Kafka Java clients * Remove the comments

dave2wave · 2021-12-16T16:56:03Z

This PR is evidently stale or abandoned. Reopen if this is not so.

github-actions · 2021-12-16T16:56:08Z

@rdhabalia:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

rdhabalia added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label Feb 8, 2017

rdhabalia added this to the 1.17 milestone Feb 8, 2017

rdhabalia self-assigned this Feb 8, 2017

rdhabalia force-pushed the zk_split branch from 2d9a49c to 8841504 Compare February 8, 2017 19:42

saandrews reviewed Feb 10, 2017

View reviewed changes

rdhabalia mentioned this pull request Feb 13, 2017

Add LoadBalancer and OwnershipCache factory to support dual write/read to local_zk and data_zk #208

Closed

Support multiple zookeeper quorum to store cluster-management-configu…

952e447

…ration and ledger-metadata separately

rdhabalia force-pushed the zk_split branch from 8841504 to 952e447 Compare February 13, 2017 20:44

rdhabalia requested a review from merlimat February 16, 2017 19:36

rdhabalia mentioned this pull request Mar 21, 2017

throttling for concurrent topic loading on broker cold-restart #300

Closed

merlimat modified the milestones: 1.17, 1.18 Mar 31, 2017

merlimat modified the milestones: 1.18, 1.19 Jun 14, 2017

sijie pushed a commit to sijie/pulsar that referenced this pull request Mar 4, 2018

adding check for failures subroutine and fixing worker delete bug (ap…

ac5ca04

…ache#196) * adding check for failures subroutine and fixing worker delete bug * adding license header

hrsakai pushed a commit to hrsakai/pulsar that referenced this pull request Dec 10, 2020

Fix consumer not found (apache#196)

ba30600

* Fix consumer not found Signed-off-by: xiaolong.ran <rxl@apache.org> * fix ci error Signed-off-by: xiaolong.ran <rxl@apache.org>

xiaotongwang1 mentioned this pull request Aug 4, 2021

Pulsar 2.7.0+ KOP 2.7.2.x getPartitionedTopicMetadata timeout #11532

Closed

dave2wave added the status/stale label Dec 16, 2021

dave2wave closed this Dec 16, 2021

github-actions bot added the doc-label-missing label Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple zookeeper quorum to store cluster-management-configu… #196

Support multiple zookeeper quorum to store cluster-management-configu… #196

rdhabalia commented Feb 8, 2017

saandrews Feb 9, 2017

rdhabalia Feb 13, 2017

saandrews Feb 9, 2017

rdhabalia Feb 13, 2017

saandrews commented Feb 15, 2017

merlimat commented Feb 17, 2017

merlimat commented Mar 5, 2017

rdhabalia commented Mar 6, 2017

merlimat commented Mar 6, 2017

rdhabalia commented Mar 7, 2017

msb-at-yahoo commented Mar 7, 2017

merlimat commented Mar 7, 2017

merlimat commented Mar 7, 2017

saandrews commented Mar 27, 2017

sijie commented Dec 27, 2018

dave2wave commented Dec 16, 2021

github-actions bot commented Dec 16, 2021

Support multiple zookeeper quorum to store cluster-management-configu… #196

Support multiple zookeeper quorum to store cluster-management-configu… #196

Conversation

rdhabalia commented Feb 8, 2017

Motivation

Modifications

Result

saandrews Feb 9, 2017

Choose a reason for hiding this comment

rdhabalia Feb 13, 2017

Choose a reason for hiding this comment

saandrews Feb 9, 2017

Choose a reason for hiding this comment

rdhabalia Feb 13, 2017

Choose a reason for hiding this comment

saandrews commented Feb 15, 2017

merlimat commented Feb 17, 2017

merlimat commented Mar 5, 2017

rdhabalia commented Mar 6, 2017

merlimat commented Mar 6, 2017

rdhabalia commented Mar 7, 2017

msb-at-yahoo commented Mar 7, 2017

merlimat commented Mar 7, 2017

merlimat commented Mar 7, 2017

saandrews commented Mar 27, 2017

sijie commented Dec 27, 2018

dave2wave commented Dec 16, 2021

github-actions bot commented Dec 16, 2021