Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proxy]If zk leader was down, proxy was not worked until restart it #6305

Closed
snowcrumble opened this issue Feb 12, 2020 · 4 comments
Closed
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@snowcrumble
Copy link
Contributor

Describe the bug
My pulsar is running on k8s and our service is using pulsar proxy, I have did some failure testing.
When I force crash the node which serve the zk leader, all my pulsar broker crash several times because zk connection problem, they were started final, but I must restart pulsar proxy after brokers were availabled, mentioned by #6158 .

To Reproduce
Steps to reproduce the behavior:

  1. Deploy pulsar on k8s
  2. use pulsar-perf tool to connect pulsar proxy and continuous pub/sub on a topic
  3. "force" crash a k8s node which has zk leader
  4. The reconnect mechanism doesn't work until restart pulsar proxys.

Expected behavior
Proxy can recover without restart.
Or broker keep available without restart.

Screenshots
Broker error log

04:55:22.078 [main] ERROR org.apache.pulsar.zookeeper.GlobalZooKeeperCache - Failed to establish global zookeeper session: null
java.util.concurrent.TimeoutException: null
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_232]
	at org.apache.pulsar.zookeeper.GlobalZooKeeperCache.start(GlobalZooKeeperCache.java:69) [org.apache.pulsar-pulsar-zookeeper-utils-2.4.2.jar:2.4.2]
	at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:572) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:366) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:332) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
04:55:22.082 [main] ERROR org.apache.pulsar.broker.PulsarService - java.io.IOException: java.util.concurrent.TimeoutException
org.apache.pulsar.broker.PulsarServerException: java.io.IOException: java.util.concurrent.TimeoutException
	at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:574) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:366) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:332) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
Caused by: java.io.IOException: java.util.concurrent.TimeoutException
	at org.apache.pulsar.zookeeper.GlobalZooKeeperCache.start(GlobalZooKeeperCache.java:76) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.2.jar:2.4.2]
	at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:572) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	... 3 more
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_232]
	at org.apache.pulsar.zookeeper.GlobalZooKeeperCache.start(GlobalZooKeeperCache.java:69) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.2.jar:2.4.2]
	at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:572) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	... 3 more
04:55:22.083 [main] ERROR org.apache.pulsar.PulsarBrokerStarter - Failed to start pulsar service.
org.apache.pulsar.broker.PulsarServerException: org.apache.pulsar.broker.PulsarServerException: java.io.IOException: java.util.concurrent.TimeoutException
	at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:472) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:332) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
Caused by: org.apache.pulsar.broker.PulsarServerException: java.io.IOException: java.util.concurrent.TimeoutException
	at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:574) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:366) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	... 2 more
Caused by: java.io.IOException: java.util.concurrent.TimeoutException
	at org.apache.pulsar.zookeeper.GlobalZooKeeperCache.start(GlobalZooKeeperCache.java:76) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.2.jar:2.4.2]
	at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:572) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:366) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	... 2 more
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_232]
	at org.apache.pulsar.zookeeper.GlobalZooKeeperCache.start(GlobalZooKeeperCache.java:69) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.2.jar:2.4.2]
	at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:572) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:366) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2]
	... 2 more
04:55:22.084 [pulsar-ordered-OrderedExecutor-3-0-SendThread(zookeeper:2181)] WARN  org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 30005ms for sessionid 0x0

zk-2.log (new leader)

04:51:47.727 [QuorumPeer[myid=3]/0.0.0.0:2181] INFO  org.apache.zookeeper.server.quorum.QuorumPeer - LEADING
04:51:47.727 [QuorumPeer[myid=3]/0.0.0.0:2181] INFO  org.apache.zookeeper.server.ZooKeeperServer - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir data/zookeeper/version-2 snapdir data/zookeeper/version-2
04:51:47.727 [QuorumPeer[myid=3]/0.0.0.0:2181] INFO  org.apache.zookeeper.server.quorum.Leader - LEADING - LEADER ELECTION TOOK - 10421
04:51:47.734 [LearnerHandler-/10.30.2.141:43982] INFO  org.apache.zookeeper.server.quorum.LearnerHandler - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@6fe50a68

perf-tool output

06:14:00.049 [pulsar-timer-5-1] INFO  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://public/default/topic-perf] [cn-beijing-105-1] Reconnecting after connection was closed
06:14:00.111 [pulsar-client-io-2-2] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xfa71e541, L:/172.17.0.2:47758 - R:/172.16.11.200:6650]] Connected to server
06:14:00.114 [main] INFO  org.apache.pulsar.testclient.PerformanceProducer - Throughput produced:      0.0  msg/s ---      0.0 Mbit/s --- failure      0.0 msg/s --- Latency: mean:   0.000 ms - med:   0.000 - 95pct:   0.000 - 99pct:   0.000 - 99.9pct:   0.000 - 99.99pct:   0.000 - Max:   0.000
06:14:00.345 [pulsar-client-io-2-2] ERROR org.apache.pulsar.client.impl.ClientCnx - [id: 0xfa71e541, L:/172.17.0.2:47758 - R:/172.16.11.200:6650] Close connection because received internal-server error No active broker is available
06:14:00.346 [pulsar-client-io-2-2] WARN  org.apache.pulsar.client.impl.BinaryProtoLookupService - [persistent://public/default/topic-perf] failed to send lookup request : org.apache.pulsar.client.api.PulsarClientException$LookupException: No active broker is available
06:14:00.347 [pulsar-client-io-2-2] WARN  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://public/default/topic-perf] [cn-beijing-105-1] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException$LookupException: No active broker is available
06:14:00.348 [pulsar-client-io-2-2] WARN  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://public/default/topic-perf] [cn-beijing-105-1] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$LookupException: No active broker is available -- Will try again in 57.763 s
06:14:00.348 [pulsar-client-io-2-2] INFO  org.apache.pulsar.client.impl.ClientCnx - [id: 0xfa71e541, L:/172.17.0.2:47758 ! R:/172.16.11.200:6650] Disconnected
@sijie
Copy link
Member

sijie commented Feb 14, 2020

Try to configure the proxies to connect to the brokers using service urls instead of using zookeeper. It will reduce the impact from zookeeper lead was done.

brokerServiceURL=
brokerServiceURLTLS=

Regarding the broker restarts when zookeeper sessions are expired, @codelipenghui is working an enhancement to improve it.

@snowcrumble
Copy link
Contributor Author

Try to configure the proxies to connect to the brokers using service urls instead of using zookeeper. It will reduce the impact from zookeeper lead was done.

brokerServiceURL=
brokerServiceURLTLS=

Regarding the broker restarts when zookeeper sessions are expired, @codelipenghui is working an enhancement to improve it.

I tried to set a long enough zooKeeperSessionTimeoutMillis=1200000 that broker would not exit for session reason, but I don't know if it has some side-effect.

@codelipenghui
Copy link
Contributor

@snowcrumble Looks related to #6259

@snowcrumble snowcrumble changed the title [Proxy]If zk leader was down, proxy was not work until restart it [Proxy]If zk leader was down, proxy was not worked until restart it Feb 14, 2020
@jiazhai
Copy link
Member

jiazhai commented Feb 17, 2020

Would like to close this since this is a dup and tracked in 6259

@jiazhai jiazhai closed this as completed Feb 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

4 participants