Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bookkeeper shutdown when zookeeper connection just timeout #2760

Closed
suiyuzeng opened this issue Aug 9, 2021 · 1 comment
Closed

Bookkeeper shutdown when zookeeper connection just timeout #2760

suiyuzeng opened this issue Aug 9, 2021 · 1 comment
Labels

Comments

@suiyuzeng
Copy link

suiyuzeng commented Aug 9, 2021

Describe the bug
Bookkeeper begin to shutdown after a zookeeper connection timeout. Session is not expired and it was attempting reconnect.
In the log, there is a ConnectionLossException when a node was deleted in org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager#releaseUnderreplicatedLedger. It can not be processed in org.apache.bookkeeper.replication.ReplicationWorker#deferLedgerLockRelease and shutdown was called.
image
It shoul be retried as syncCallWithRetries was called. One cause is that the maxRetries of the retry policy was setted to be 0. The zk client is initialized in org.apache.bookkeeper.meta.zk.ZKMetadataDriverBase#initialize. And the retrying policy is created in org.apache.bookkeeper.meta.zk.ZKMetadataClientDriver#initialize.
image
The maxRetries is setted 0. Can we set it to be Integer.MAX_VALUE?

To Reproduce

Steps to reproduce the behavior:

  1. Run bookkeeper
  2. release ledger lock when zk timeout

LOG:
06:18:45.732 [main-SendThread(10.168.114.200:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x37b0a37639d001d for sever xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x37b0a37639d001d, likely server has closed socket
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]

06:18:45.757 [main-SendThread(10.168.114.200:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x47b0a3221890013 for sever xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x47b0a3221890013, likely server has closed socket
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
06:18:45.859 [PendingReplicationTimer] ERROR org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Error deleting underreplicated ledger lock
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/locks/urL0000826574
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
06:18:45.867 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException while replicating fragments
org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:604) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:272) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:238) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/ledgers/0000/0000/0010/7cc4/urL0001080516
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2256) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2301(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:833) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:827) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:827) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2281) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2401(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:854) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:848) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:848) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:519) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:597) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
... 4 more
06:18:45.867 [PendingReplicationTimer] ERROR org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException while replicating fragments of ledger 826574
org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:635) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/locks/urL0000826574
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
... 3 more
06:18:45.867 [PendingReplicationTimer] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker
06:18:45.867 [PendingReplicationTimer] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down ReplicationWorker
06:18:45.867 [main-EventThread] INFO org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is disconnected from zookeeper now, but it is OK unless we received EXPIRED event.
06:18:45.874 [ReplicationWorker] INFO org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker exited loop!
06:18:46.516 [AutoRecoveryDeathWatcher-3181] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - AutoRecoveryDeathWatcher noticed the AutoRecovery is not running any more,exiting the watch loop!
06:18:46.516 [AutoRecoveryDeathWatcher-3181] ERROR org.apache.bookkeeper.common.component.ComponentStarter - Triggered exceptionHandler of Component: bookie-server because of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
java.lang.RuntimeException: AutoRecovery is not running any more
at org.apache.bookkeeper.replication.AutoRecoveryMain$AutoRecoveryDeathWatcher.run(AutoRecoveryMain.java:237) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
06:18:46.528 [component-shutdown-thread] INFO org.apache.bookkeeper.common.component.ComponentStarter - Closing component bookie-server in shutdown hook.
06:18:46.530 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down auto recovery: 0
06:18:46.530 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down AutoRecovery
06:18:46.532 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.Auditor - Shutting down auditor
06:18:46.532 [AuditorElector-10.90.126.13:3181] INFO org.apache.bookkeeper.replication.AuditorElector - Shutting down AuditorElector
06:18:46.532 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker

zymap pushed a commit that referenced this issue Jan 10, 2022
### Motivation

1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed.

2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration

### Changes

1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver
2. change zkRetryBackoffMaxRetries to config

Master Issue: #2760
@StevenLuMT
Copy link
Member

bugfix in pr #2958

Shawyeok pushed a commit to Shawyeok/bookkeeper that referenced this issue Oct 21, 2022
…#2958)

### Motivation

1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed.

2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration

### Changes

1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver
2. change zkRetryBackoffMaxRetries to config

Master Issue: apache#2760

(cherry picked from commit 84ddc91)
dlg99 pushed a commit to dlg99/bookkeeper that referenced this issue Oct 24, 2022
…#2958)

### Motivation

1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed.

2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration

### Changes

1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver
2. change zkRetryBackoffMaxRetries to config

Master Issue: apache#2760

(cherry picked from commit 84ddc91)
Shawyeok pushed a commit to Shawyeok/bookkeeper that referenced this issue Oct 25, 2022
…#2958)

### Motivation

1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed.

2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration

### Changes

1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver
2. change zkRetryBackoffMaxRetries to config

Master Issue: apache#2760

(cherry picked from commit 84ddc91)
hangc0276 pushed a commit to hangc0276/bookkeeper that referenced this issue Nov 7, 2022
…#2958)

### Motivation

1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed.

2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration

### Changes

1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver
2. change zkRetryBackoffMaxRetries to config

Master Issue: apache#2760

(cherry picked from commit 84ddc91)
hangc0276 pushed a commit to hangc0276/bookkeeper that referenced this issue Nov 7, 2022
…#2958)

### Motivation

1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed.

2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration

### Changes

1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver
2. change zkRetryBackoffMaxRetries to config

Master Issue: apache#2760

(cherry picked from commit 84ddc91)
Ghatage pushed a commit to sijie/bookkeeper that referenced this issue Jul 12, 2024
…#2958)

### Motivation

1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed.

2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration

### Changes

1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver
2. change zkRetryBackoffMaxRetries to config

Master Issue: apache#2760
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants