Problems when ZK Leader shuts down #308

sschepens · 2017-03-22T19:46:09Z

@merlimat we had to replace our current local ZK leader in a cluster and this seems to cause a LOT of issues in the cluster.
Brokers seem to have shut down all at the same time, leaving the cluster unable to handle traffic until restarted.
Also, a lot of consumers seem to have been reset to a previous moment in time, generating a huge amount of backlog.

We see these logs before the broker apparently shut down:

March 22nd 2017, 15:31:43.156	2017-03-22 18:31:43,155 - INFO  - [main-SendThread(ip-10-64-102-223.ec2.internal:2181):ClientCnxn$SendThread@1158] - Unable to read additional data from server sessionid 0x35a384d476e093b, likely server has closed socket, closing socket connection and attempting reconnect
March 22nd 2017, 15:31:43.258	2017-03-22 18:31:43,258 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:1 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.258	2017-03-22 18:31:43,258 - WARN  - [main-EventThread:LeaderElectionService$1@111] - Got something wrong on watch: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.258	2017-03-22 18:31:43,258 - WARN  - [main-EventThread:LeaderElectionService$1@92] - Type of the event is [None] and path is [null]
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,258 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,258 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,258 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,258 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:1 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,259 - INFO  - [main-EventThread:ZooKeeperDataCache@131] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,259 - INFO  - [main-EventThread:ZooKeeperSessionWatcher@87] - Received zookeeper notification, eventType=None, eventState=Disconnected
March 22nd 2017, 15:31:43.259	2017-03-22 18:31:43,259 - INFO  - [main-EventThread:ZooKeeperCache@346] - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null
March 22nd 2017, 15:31:43.587	2017-03-22 18:31:43,586 - INFO  - [main-SendThread(ip-10-64-102-117.ec2.internal:2181):ClientCnxn$SendThread@1032] - Opening socket connection to server ip-10-64-102-117.ec2.internal/10.64.102.117:2181. Will not attempt to authenticate using SASL (unknown error)

A couple of questions:
1 - Why do brokers shutdown on ZK Leader disconnection?
2 - Why could this affect the backlog of consumers? we're running with a branch that persists individualDeletedMessages.

The text was updated successfully, but these errors were encountered:

saandrews · 2017-03-23T17:31:09Z

Did you check how long did the zookeeper take to elect the new leader? Most likely it took longer and brokers lost their session and shut them down. Did you notice if the followers were lagging behind or it's snapshot is corrupt/incomplete before you took down the leader?

…

On Wed, Mar 22, 2017 at 12:46 PM, sschepens ***@***.***> wrote: @merlimat <https://github.com/merlimat> we had to replace our current local ZK leader in a cluster and this seems to cause a LOT of issues in the cluster. Brokers seem to have shut down all at the same time, leaving the cluster unable to handle traffic until restarted. Also, a lot of consumers seem to have been reset to a previous moment in time, generating a huge amount of backlog. We see these logs before the broker apparently shut down: March 22nd 2017, 15:31:43.156 2017-03-22 18:31:43,155 - INFO - ***@***.*** - Unable to read additional data from server sessionid 0x35a384d476e093b, likely server has closed socket, closing socket connection and attempting reconnect March 22nd 2017, 15:31:43.258 2017-03-22 18:31:43,258 - INFO - ***@***.*** - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 <(725)%20451-6425> xid:1622313 sent:1622313 recv:1791041 queuedpkts:1 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null March 22nd 2017, 15:31:43.258 2017-03-22 18:31:43,258 - WARN - ***@***.*** - Got something wrong on watch: WatchedEvent state:Disconnected type:None path:null March 22nd 2017, 15:31:43.258 2017-03-22 18:31:43,258 - WARN - ***@***.*** - Type of the event is [None] and path is [null] March 22nd 2017, 15:31:43.259 2017-03-22 18:31:43,258 - INFO - ***@***.*** - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 <(725)%20451-6425> xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null March 22nd 2017, 15:31:43.259 2017-03-22 18:31:43,258 - INFO - ***@***.*** - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 <(725)%20451-6425> xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null March 22nd 2017, 15:31:43.259 2017-03-22 18:31:43,258 - INFO - ***@***.*** - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 <(725)%20451-6425> xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null March 22nd 2017, 15:31:43.259 2017-03-22 18:31:43,258 - INFO - ***@***.*** - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 <(725)%20451-6425> xid:1622313 sent:1622313 recv:1791041 queuedpkts:1 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null March 22nd 2017, 15:31:43.259 2017-03-22 18:31:43,259 - INFO - ***@***.*** - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 <(725)%20451-6425> xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null March 22nd 2017, 15:31:43.259 2017-03-22 18:31:43,259 - INFO - ***@***.*** - Received zookeeper notification, eventType=None, eventState=Disconnected March 22nd 2017, 15:31:43.259 2017-03-22 18:31:43,259 - INFO - ***@***.*** - [State:CONNECTED Timeout:30000 sessionid:0x35a384d476e093b local:null remoteserver:null lastZxid:17254516425 <(725)%20451-6425> xid:1622313 sent:1622313 recv:1791041 queuedpkts:2 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:Disconnected type:None path:null March 22nd 2017, 15:31:43.587 2017-03-22 18:31:43,586 - INFO - ***@***.*** - Opening socket connection to server ip-10-64-102-117.ec2.internal/10.64.102.117:2181. Will not attempt to authenticate using SASL (unknown error) A couple of questions: 1 - Why do brokers shutdown on ZK Leader disconnection? 2 - Why could this affect the backlog of consumers? we're running with a branch that persists individualDeletedMessages. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#308>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATk4LBfklfMvdfwW1R67bVM-F_pJHPCsks5roXqIgaJpZM4Mlump> .

merlimat · 2017-03-24T00:33:00Z

we had to replace our current local ZK leader in a cluster and this seems to cause a LOT of issues in the cluster.

Restarting the leader should not be causing a ZK quorum loss, provided the other 2 (or 4) servers are fine. Can you check the ZK servers logs at the time the leader was taken out? There should be some lines in the logs that tells how much time it took to elect a new leader.

If the leader election takes (overall) more than the session timeout configured in the brokers, the brokers will restart themselves.

The question here is why should the leader election take so much time. Can you share the hardware and ZK config?

Another important factor is the snapshot size. Leader election is dependent on that. If the size of the metadata kept in ZK grows to the 100s of MB it can start slowing down the leader election. Internally, we had a patch to fix that. It has been merged upstream and will be released in ZK 3.4.10 (https://issues.apache.org/jira/browse/ZOOKEEPER-2678).

If leader election timing is the cause in this case, increasing the sessionTimeout in the brokers can make them able to "survive" and reconnect to quorum before the session is expired, thus not needing to restart.

1 - Why do brokers shutdown on ZK Leader disconnection?

The main reason is to preserve the ownership of the topic. If I cannot ensure I have a lock on a particular set of topics, then some other broker will take them over. Second reason is how to reconciliate the state of the metadata between what the broker has in memory and what is in ZK. That can diverge during a session loss because of timeouts and other problems.

We have already been discussing a bit on improving this behavior by having a "degraded" mode in which brokers with no ZK session can continue sending and delivering messages without attempting to do any metadata operations (no new topics, no ledger rollovers..). The design is still up in the air, but it's an area that we surely are going to improve.

2 - Why could this affect the backlog of consumers? we're running with a branch that persists individualDeletedMessages.

Not sure about this one. One possibility is that after the restart, the cursor recovery failed and that triggers a rollback. The logic there is that if we cannot recover a cursor for a permanent error, and thus we don't know the position, we roll it back to the oldest available message to avoid data loss.

Can you share the logs in the topic loading phase after the restart?

sijie · 2019-05-19T13:40:51Z

Close this issue due to inactivities. Please reopen it if there are more things we should be looking into it.

snowcrumble · 2020-02-12T05:56:05Z

The question here is why should the leader election take so much time. Can you share the hardware and ZK config?

My zk elect time is about 10 seconds, is that reasonable?

zk-0 is new follower
zk-1 is old leader which was on the crashed node
zk2 is new leader

below is the logs:
zk-0.log

04:51:47.726 [QuorumPeer[myid=1]/0.0.0.0:2181] INFO  org.apache.zookeeper.server.quorum.QuorumPeer - FOLLOWING
04:51:47.727 [QuorumPeer[myid=1]/0.0.0.0:2181] INFO  org.apache.zookeeper.server.ZooKeeperServer - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir data/zookeeper/version-2 snapdir data/zookeeper/version-2
04:51:47.727 [QuorumPeer[myid=1]/0.0.0.0:2181] INFO  org.apache.zookeeper.server.quorum.Learner - FOLLOWING - LEADER ELECTION TOOK - 6405
04:51:47.728 [QuorumPeer[myid=1]/0.0.0.0:2181] INFO  org.apache.zookeeper.server.quorum.QuorumPeer - Resolved hostname: zk-2.zookeeper.pulsar.svc.cluster.local to address: zk-2.zookeeper.pulsar.svc.cluster.local/10.30.2.2
04:51:47.740 [QuorumPeer[myid=1]/0.0.0.0:2181] INFO  org.apache.zookeeper.server.quorum.Learner - Getting a diff from the leader 0x100024c1c

zk-2.log

04:51:47.727 [QuorumPeer[myid=3]/0.0.0.0:2181] INFO  org.apache.zookeeper.server.quorum.QuorumPeer - LEADING
04:51:47.727 [QuorumPeer[myid=3]/0.0.0.0:2181] INFO  org.apache.zookeeper.server.ZooKeeperServer - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir data/zookeeper/version-2 snapdir data/zookeeper/version-2
04:51:47.727 [QuorumPeer[myid=3]/0.0.0.0:2181] INFO  org.apache.zookeeper.server.quorum.Leader - LEADING - LEADER ELECTION TOOK - 10421
04:51:47.734 [LearnerHandler-/10.30.2.141:43982] INFO  org.apache.zookeeper.server.quorum.LearnerHandler - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@6fe50a68

This function was recently changed causing a failure in broker metrics. (cherry picked from commit 751aea3)

Co-authored-by: Paul Gier <paul.gier@datastax.com>

Co-authored-by: Paul Gier <paul.gier@datastax.com> (cherry picked from commit cd3519a)

…he#23227) Co-authored-by: Paul Gier <paul.gier@datastax.com>

sschepens mentioned this issue Mar 22, 2017

Fail to reset-cursor on partitioned topic #309

Closed

sijie added deprecated/question Questions should happened in GitHub Discussions triage/week-34 labels Aug 22, 2018

sijie added triage/week-48 and removed triage/week-34 labels Nov 28, 2018

sijie closed this as completed May 19, 2019

jiazhai mentioned this issue Nov 25, 2019

handle broker shutdown when no zk session #5740

Closed

sijie mentioned this issue Dec 27, 2019

ISSUE-5740: handle broker shutdown when no zk session streamnative/pulsar-archived#444

Open

xiaotongwang1 mentioned this issue Aug 2, 2021

Pulsar 2.7.0+ KOP 2.7.2.x getPartitionedTopicMetadata timeout #11532

Closed

dlg99 pushed a commit to dlg99/pulsar that referenced this issue Aug 26, 2024

AS-4027: add string write function (apache#308)

a6e1075

This function was recently changed causing a failure in broker metrics. (cherry picked from commit 751aea3)

lhotari pushed a commit that referenced this issue Aug 26, 2024

[fix] StatsOutputStream: add string write function (#308) (#23227)

cd3519a

Co-authored-by: Paul Gier <paul.gier@datastax.com>

lhotari pushed a commit that referenced this issue Aug 28, 2024

[fix] StatsOutputStream: add string write function (#308) (#23227)

4e460c8

Co-authored-by: Paul Gier <paul.gier@datastax.com> (cherry picked from commit cd3519a)

lhotari pushed a commit that referenced this issue Aug 28, 2024

[fix] StatsOutputStream: add string write function (#308) (#23227)

df21391

Co-authored-by: Paul Gier <paul.gier@datastax.com> (cherry picked from commit cd3519a)

grssam pushed a commit to grssam/pulsar that referenced this issue Sep 4, 2024

[fix] StatsOutputStream: add string write function (apache#308) (apac…

d9f13f7

…he#23227) Co-authored-by: Paul Gier <paul.gier@datastax.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems when ZK Leader shuts down #308

Problems when ZK Leader shuts down #308

sschepens commented Mar 22, 2017

saandrews commented Mar 23, 2017 via email

merlimat commented Mar 24, 2017

sijie commented May 19, 2019

snowcrumble commented Feb 12, 2020

Problems when ZK Leader shuts down #308

Problems when ZK Leader shuts down #308

Comments

sschepens commented Mar 22, 2017

saandrews commented Mar 23, 2017 via email

merlimat commented Mar 24, 2017

sijie commented May 19, 2019

snowcrumble commented Feb 12, 2020