-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bookkeeper shutdown when zookeeper connection just timeout #2760
Labels
Comments
This was referenced Dec 23, 2021
zymap
pushed a commit
that referenced
this issue
Jan 10, 2022
### Motivation 1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed. 2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration ### Changes 1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver 2. change zkRetryBackoffMaxRetries to config Master Issue: #2760
bugfix in pr #2958 |
Shawyeok
pushed a commit
to Shawyeok/bookkeeper
that referenced
this issue
Oct 21, 2022
…#2958) ### Motivation 1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed. 2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration ### Changes 1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver 2. change zkRetryBackoffMaxRetries to config Master Issue: apache#2760 (cherry picked from commit 84ddc91)
dlg99
pushed a commit
to dlg99/bookkeeper
that referenced
this issue
Oct 24, 2022
…#2958) ### Motivation 1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed. 2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration ### Changes 1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver 2. change zkRetryBackoffMaxRetries to config Master Issue: apache#2760 (cherry picked from commit 84ddc91)
Shawyeok
pushed a commit
to Shawyeok/bookkeeper
that referenced
this issue
Oct 25, 2022
…#2958) ### Motivation 1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed. 2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration ### Changes 1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver 2. change zkRetryBackoffMaxRetries to config Master Issue: apache#2760 (cherry picked from commit 84ddc91)
hangc0276
pushed a commit
to hangc0276/bookkeeper
that referenced
this issue
Nov 7, 2022
…#2958) ### Motivation 1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed. 2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration ### Changes 1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver 2. change zkRetryBackoffMaxRetries to config Master Issue: apache#2760 (cherry picked from commit 84ddc91)
hangc0276
pushed a commit
to hangc0276/bookkeeper
that referenced
this issue
Nov 7, 2022
…#2958) ### Motivation 1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed. 2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration ### Changes 1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver 2. change zkRetryBackoffMaxRetries to config Master Issue: apache#2760 (cherry picked from commit 84ddc91)
Ghatage
pushed a commit
to sijie/bookkeeper
that referenced
this issue
Jul 12, 2024
…#2958) ### Motivation 1. bug fix for error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver,if set MaxRetries zero, zk client will throw ConnectionLossException when the zk has some changing,for example: zk leader node changed. 2. In Bookie's ZKClient, different BoundExponentialBackoffRetryPolicy set different MaxRetries,so change zkRetryBackoffMaxRetries to config in Bookie's AbstractConfiguration ### Changes 1. update a error config for BoundExponentialBackoffRetryPolicy in class ZKMetadataClientDriver 2. change zkRetryBackoffMaxRetries to config Master Issue: apache#2760
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Bookkeeper begin to shutdown after a zookeeper connection timeout. Session is not expired and it was attempting reconnect.
In the log, there is a ConnectionLossException when a node was deleted in org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager#releaseUnderreplicatedLedger. It can not be processed in org.apache.bookkeeper.replication.ReplicationWorker#deferLedgerLockRelease and shutdown was called.
It shoul be retried as syncCallWithRetries was called. One cause is that the maxRetries of the retry policy was setted to be 0. The zk client is initialized in org.apache.bookkeeper.meta.zk.ZKMetadataDriverBase#initialize. And the retrying policy is created in org.apache.bookkeeper.meta.zk.ZKMetadataClientDriver#initialize.
The maxRetries is setted 0. Can we set it to be Integer.MAX_VALUE?
To Reproduce
Steps to reproduce the behavior:
LOG:
06:18:45.732 [main-SendThread(10.168.114.200:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x37b0a37639d001d for sever xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x37b0a37639d001d, likely server has closed socket
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
06:18:45.757 [main-SendThread(10.168.114.200:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x47b0a3221890013 for sever xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x47b0a3221890013, likely server has closed socket
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
06:18:45.859 [PendingReplicationTimer] ERROR org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Error deleting underreplicated ledger lock
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/locks/urL0000826574
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
06:18:45.867 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException while replicating fragments
org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:604) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:272) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:238) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/ledgers/0000/0000/0010/7cc4/urL0001080516
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2256) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2301(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:833) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:827) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:827) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2281) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2401(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:854) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:848) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:848) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:519) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:597) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
... 4 more
06:18:45.867 [PendingReplicationTimer] ERROR org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException while replicating fragments of ledger 826574
org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:635) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/locks/urL0000826574
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
... 3 more
06:18:45.867 [PendingReplicationTimer] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker
06:18:45.867 [PendingReplicationTimer] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down ReplicationWorker
06:18:45.867 [main-EventThread] INFO org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is disconnected from zookeeper now, but it is OK unless we received EXPIRED event.
06:18:45.874 [ReplicationWorker] INFO org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker exited loop!
06:18:46.516 [AutoRecoveryDeathWatcher-3181] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - AutoRecoveryDeathWatcher noticed the AutoRecovery is not running any more,exiting the watch loop!
06:18:46.516 [AutoRecoveryDeathWatcher-3181] ERROR org.apache.bookkeeper.common.component.ComponentStarter - Triggered exceptionHandler of Component: bookie-server because of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
java.lang.RuntimeException: AutoRecovery is not running any more
at org.apache.bookkeeper.replication.AutoRecoveryMain$AutoRecoveryDeathWatcher.run(AutoRecoveryMain.java:237) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
06:18:46.528 [component-shutdown-thread] INFO org.apache.bookkeeper.common.component.ComponentStarter - Closing component bookie-server in shutdown hook.
06:18:46.530 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down auto recovery: 0
06:18:46.530 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down AutoRecovery
06:18:46.532 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.Auditor - Shutting down auditor
06:18:46.532 [AuditorElector-10.90.126.13:3181] INFO org.apache.bookkeeper.replication.AuditorElector - Shutting down AuditorElector
06:18:46.532 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker
The text was updated successfully, but these errors were encountered: