-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry when zookeeper session expired at broker side. #6259
Comments
For pulsar broker the issue is slightly more complicated since there’s the concept of ownership of topics. Also we need to make sure to sync back all the metadata versions to account for failure of request during the session expired period. |
Yes, thanks for reminding me. I will try to achieve it. If I encounter problems, I will ask here. |
When the session timeout happens on the local Zookeeper, all of the EPHEMERAL znode maintained by this broker will delete automatically. We need some mechanism to avoid the unnecessary ownership transfer of the bundles. Since the broker cached the owned bundles in memory, the broker can use the cache to re-own the bundles. Firstly, the broker should check if the znode exists for the bundle and the bundle owner is this broker. If the znode exists and the owner is this broker, it may be that the znode has not been deleted. The broker should check if the ephemeral owner is the current session ID. If not, the broker should wait for the znode deletion. Then the broker tries to own the bundle. If the broker owns the bundle success means the bundle is not owned by other brokers, the broker should check whether to preload the topics under the bundle. If the broker failed to own the bundle means the bundle owned by another broker. The broker should unload the bundle. Please help check if the processing idea about the ownership is correct, I checked other places using the local zookeeper, It should be easy to handle. If the idea is right, I will draft a PIP |
I was running into some issues where there were timeouts in the Pulsar client application logs. The Pulsar brokers were restarting. This was happening in a load test and it seemed that the broker restarts made the problem worse. To mitigate the issue, the brokers were recently configured to use zookeeperSessionExpiredPolicy=reconnect setting. The broker zookeeper timeout related settings are at the default values (zooKeeperSessionTimeoutMillis=30000, zooKeeperOperationTimeoutSeconds=30), since it seems odd that the Zookeeper interactions would take longer since the CPU consumption looks very low on zk in the load test. Now, we reran the load test and the brokers became unavailable. The logs were filled with this type of errors:
This is with The benefit of using In the thread dump, there are quite a few hanging threads. Some stack traces are similar to what #8406 fixes.
Full thread dump of broker: The reason why I'm bringing this up in this issue is that a Zookeeper timeout exception in the broker logs can actually be caused by threads hanging in the broker. Hopefully #8406 gets also fixed in the upcoming Pulsar 2.6.2 version UPDATE: I now noticed that there's an incoming PR #8304 (fix for #4635 which is also a deadlock). |
Is your feature request related to a problem? Please describe.
Currently, when a broker met Zookeeper session expired event, the broker shutdown itself. When the broker or zookeeper servicer under high load and the session timeout is short, this can easily cause the broker to go down. There are some related issues #6251.
At Apache BookKeeper, when met Zookeeper session expired event, the bookie will re-register metadata. So broker also can refer to such processing.
The text was updated successfully, but these errors were encountered: