Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple zookeeper quorum to store cluster-management-configu… #196

Closed
wants to merge 1 commit into from

Conversation

rdhabalia
Copy link
Contributor

…ration and ledger-metadata separately

Motivation

Pulsar is using a single zookeeper ensemble to store cluster management and ledger information. When the number of topics hosted by Pulsar reaches significantly high ( > 1million), zookeeper data footprint and requests sent to zookeeper increases, especially during cold restart and sometime it affects broker's availability.
Since the data needed to keep the cluster up and running is relatively small (clusterManagementConfiguration) compared to the ledger information, using a separate zookeeper ensemble to store this information will increase the availability of the cluster.

Modifications

Added optional dataZookeeperServers to service configuration to introduce separate dataZk that stores ledger meta-data separately into provided dataZk.

Result

  • No impact to existing Pulsar configuration cluster which uses only one ensemble
  • Broker can store cluster-management-config and ledger-metadata into two separate provided zk
  • Pulsar service should come up even if the data-zk is not available and Pulsar should continue to serve requests even if data-zk fails when Pulsar service is up.

@rdhabalia rdhabalia added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label Feb 8, 2017
@rdhabalia rdhabalia added this to the 1.17 milestone Feb 8, 2017
@rdhabalia rdhabalia self-assigned this Feb 8, 2017
@@ -906,6 +924,14 @@ public void setZooKeeperSessionTimeoutMillis(long zooKeeperSessionTimeoutMillis)
this.zooKeeperSessionTimeoutMillis = zooKeeperSessionTimeoutMillis;
}

public long getDataZooKeeperSessionTimeoutMillis() {
return dataZooKeeperSessionTimeoutMillis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we return zooKeeperSessionTimeout if dataZookeeperServers is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we are initializing zooKeeperSessionTimeoutMillis=30000 and dataZooKeeperSessionTimeoutMillis=60000 with default value.

@@ -53,7 +54,11 @@

@Parameter(names = { "-zk",
"--zookeeper" }, description = "Local ZooKeeper quorum connection string", required = true)
private String zookeeper;
private String localZookeeper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us not use zookeeper and localZookeeper interchangeably. ServiceConfiguration refers to zookeeper. Let us stick to one naming convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Updated it to zookeeper.

@saandrews
Copy link
Contributor

👍 Can you check the build?

@merlimat
Copy link
Contributor

@rdhabalia I'll get to review this one soon...

@merlimat
Copy link
Contributor

merlimat commented Mar 5, 2017

The bigger question I have is how to take advantage of this, meaning which changes would be needed to "survive" a quorum loss on the data ZK and how to make sure that everything is back in order in every condition. :)

Anyway, is this still a big concern, given that apache/zookeeper#157 has been merged and will be available in zk-3.4.10 ?

@rdhabalia
Copy link
Contributor Author

Here, the problem which we are trying to fix is: Prevent cold-restart of broker due to dataZK's large snapshot size:

  • As dataZk has larger snapshot size and many times zk-session times out at broker due to zk-gc or leader-election. It causes cold-restart at brokers and all clients try to reconnect at the same time which forces all the brokers to load all topics at the same time and it again creates significant back-pressure at zookeeper which again may cause quorum loss in zk.

Therefore, if we have separate dataZK and clusterManagementZK then dataZk will not cause broker restart and pulsar-broker can still survive without dataZk.

how to make sure that everything is back in order in every condition.

as dataZk is down, all clients fail while publish/consume and creation of producer/consumer. So, there is no data loss however, when dataZk comes back that's the only signal we have to make sure everything is back in order and probably monitoring such as SLA-monitoring can help to confirm broker's stability.

Anyway, is this still a big concern, given that apache/zookeeper#157 has been merged and will be available in zk-3.4.10 ?

I think this ZK-patch will help to reduce leader-election time and it will prevent zk-session timeout at broker and also prevent broker-restart. This will definitely help in this problem statement. But, I think it's always good to protect broker against zk-session timeout.

@merlimat
Copy link
Contributor

merlimat commented Mar 6, 2017

Therefore, if we have separate dataZK and clusterManagementZK then dataZk will not cause broker restart and pulsar-broker can still survive without dataZk.

True, but that's not going to change much, you still cannot survive the quorum loss on clusterManagementZK, plus there need to be significant changes in order to continue operating when loosing dataZk quorum.

as dataZk is down, all clients fail while publish/consume and creation of producer/consumer. So,
there is no data loss however, when dataZk comes back that's the only signal we have to make
sure everything is back in order and probably monitoring such as SLA-monitoring can help to
confirm broker's stability.

The tricky part is how to come back and have everything in sync between broker and ZK quorum. Eg: version numbers for transactions that were caught in-flight during the quorum loss.

I think this ZK-patch will help to reduce leader-election time and it will prevent zk-session timeout > at broker and also prevent broker-restart. This will definitely help in this problem statement. But, I > think it's always good to protect broker against zk-session timeout.

The point there is that, by making the leader election not dependent on the snapshot size, the chances of loosing one or the other ZK quorum are not that different.

The only difference at that point would be the memory footprint of the quorum and (apart from configuring the JVM heap accordingly), the easiest way to improve the conditions there would be to use binary protobuf for the z-nodes content.

Even just doing that for ML and cursors (and at least initially, not for BK ledgers) would considerably reduce the ZK data set size. I would expect the same info to take 1/10th of size in binary vs text format.

@rdhabalia
Copy link
Contributor Author

Ok. few more points on it:

True, but that's not going to change much, you still cannot survive the quorum loss on clusterManagementZK

I think our main concern is larger footprint causes instability (larger gc) and quorum-loss in zk which impacts broker's cold-restart and it has recursive negative impact on both zk and broker. So, clusterManagementZK will just store the ephemeral-nodes such as bundle-ownership and loadbalancer information, which creates smallest snapshot size and that's the main reason to prevent zk quorum loss.

The tricky part is how to come back and have everything in sync between broker and ZK quorum. Eg: version numbers for transactions that were caught in-flight during the quorum loss.

This should be fix by reading latest zk-data and refreshing the version when broker sees bad-version when system comes back.

@msb-at-yahoo
Copy link
Contributor

Has anyone estimated the impact on snapshot size for workloads people care about as proposed in #281 (comment)?

@merlimat
Copy link
Contributor

merlimat commented Mar 7, 2017

Also, it would be good to record worse case GC pauses in ZK servers.

My suspect is that, in a correctly configured machine (no swap, heap large enough for data-set), the GC pauses would be an order of magnitude less than what is needed to cause a zk server to be dropped from quorum.

@merlimat
Copy link
Contributor

merlimat commented Mar 7, 2017

Also, the risk of #281, compared to the risk involved in transitioning to the 2 zk quorums is way lower.

@saandrews
Copy link
Contributor

@merlimat, Can we revisit this? We could still use this feature when we setup new cluster.

@merlimat merlimat modified the milestones: 1.17, 1.18 Mar 31, 2017
@merlimat merlimat modified the milestones: 1.18, 1.19 Jun 14, 2017
sijie pushed a commit to sijie/pulsar that referenced this pull request Mar 4, 2018
…ache#196)

* adding check for failures subroutine and fixing worker delete bug

* adding license header
@sijie
Copy link
Member

sijie commented Dec 27, 2018

@rdhabalia are you still working on this pull request? or can it be closed?

hrsakai pushed a commit to hrsakai/pulsar that referenced this pull request Dec 10, 2020
* Fix consumer not found

Signed-off-by: xiaolong.ran <rxl@apache.org>

* fix ci error

Signed-off-by: xiaolong.ran <rxl@apache.org>
hangc0276 pushed a commit to hangc0276/pulsar that referenced this pull request May 26, 2021
Master issue: apache#138

This PR adds tests for images of apache#195 .

In addition, it fixes the local test error when running tests on MacOS. Because the containers cannot connect to the host's KoP or Kafka service listened on 127.0.0.1.



* Fix local tests error by listening to site local address

* Add tests for Kafka Java clients

* Remove the comments
@dave2wave
Copy link
Member

This PR is evidently stale or abandoned. Reopen if this is not so.

@dave2wave dave2wave closed this Dec 16, 2021
@github-actions
Copy link

@rdhabalia:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-label-missing type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants