-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proxy] Race condition in Pulsar Proxy that causes UnsupportedOperationExceptions in Proxy logs #13923
Comments
I'm working on a fix for this issue. |
One explanation why the current solution doesn't work as expected can be found in the javadoc of https://netty.io/4.1/api/io/netty/handler/flow/FlowControlHandler.html . The Pulsar Proxy doesn't currently use FlowControlHandler and therefore multiple messages can be in the pipeline although the implementation calls read one-by-one. |
Fixes apache#14075 Fixes apache#13923 - Optimize the proxy connection to fail-fast if the target broker isn't active - This reduces the number of hanging connections when unavailable brokers aren't unnecessarily attempted to be reached. - Pulsar client will retry connecting after a back off timeout - Fixes the race condition in the Pulsar Proxy when opening a connection since that could lead to invalid states and hanging connections - Add connect timeout handling to proxy connection - default to 10000 ms which is also the default of client's connect timeout - Add read timeout handling to incoming connection and proxied connection - the ping/pong keepalive messages should prevent the timeout happening, however it's possible that the connection is in a state where keepalives aren't happening. - therefore it's better to have a connection level read timeout prevent broken connections left hanging in the proxy
Fixes apache#14075 Fixes apache#13923 - Optimize the proxy connection to fail-fast if the target broker isn't active - This reduces the number of hanging connections when unavailable brokers aren't unnecessarily attempted to be reached. - Pulsar client will retry connecting after a back off timeout - Fixes the race condition in the Pulsar Proxy when opening a connection since that could lead to invalid states and hanging connections - Add connect timeout handling to proxy connection - default to 10000 ms which is also the default of client's connect timeout - Add read timeout handling to incoming connection and proxied connection - the ping/pong keepalive messages should prevent the timeout happening, however it's possible that the connection is in a state where keepalives aren't happening. - therefore it's better to have a connection level read timeout prevent broken connections left hanging in the proxy
Fixes apache#14075 Fixes apache#13923 - Optimize the proxy connection to fail-fast if the target broker isn't active - This reduces the number of hanging connections when unavailable brokers aren't unnecessarily attempted to be reached. - Pulsar client will retry connecting after a back off timeout - Fixes the race condition in the Pulsar Proxy when opening a connection since that could lead to invalid states and hanging connections - Add connect timeout handling to proxy connection - default to 10000 ms which is also the default of client's connect timeout - Add read timeout handling to incoming connection and proxied connection - the ping/pong keepalive messages should prevent the timeout happening, however it's possible that the connection is in a state where keepalives aren't happening. - therefore it's better to have a connection level read timeout prevent broken connections left hanging in the proxy
Fixes apache#14075 Fixes apache#13923 - Optimize the proxy connection to fail-fast if the target broker isn't active - This reduces the number of hanging connections when unavailable brokers aren't unnecessarily attempted to be reached. - Pulsar client will retry connecting after a back off timeout - Fixes the race condition in the Pulsar Proxy when opening a connection since that could lead to invalid states and hanging connections - Add connect timeout handling to proxy connection - default to 10000 ms which is also the default of client's connect timeout - Add read timeout handling to incoming connection and proxied connection - the ping/pong keepalive messages should prevent the timeout happening, however it's possible that the connection is in a state where keepalives aren't happening. - therefore it's better to have a connection level read timeout prevent broken connections left hanging in the proxy
Fixes apache#14075 Fixes apache#13923 - Optimize the proxy connection to fail-fast if the target broker isn't active - This reduces the number of hanging connections when unavailable brokers aren't unnecessarily attempted to be reached. - Pulsar client will retry connecting after a back off timeout - Fixes the race condition in the Pulsar Proxy when opening a connection since that could lead to invalid states and hanging connections - Add connect timeout handling to proxy connection - default to 10000 ms which is also the default of client's connect timeout - Add read timeout handling to incoming connection and proxied connection - the ping/pong keepalive messages should prevent the timeout happening, however it's possible that the connection is in a state where keepalives aren't happening. - therefore it's better to have a connection level read timeout prevent broken connections left hanging in the proxy
This problem remains. This log message appears in production environments and also in integration tests. recent example in https://github.com/apache/pulsar/actions/runs/3927573395/jobs/6714826271#step:12:7081
|
One interesting detail is that there's |
Is the issue how we handle the upstream broker closing the channel? pulsar/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/DirectProxyHandler.java Lines 485 to 487 in d6fcdb8
I don't yet know the lifecycle for the |
@lhotari great observation about the |
Additional observations based on this test log: https://github.com/apache/pulsar/actions/runs/3927573395/jobs/6714826271#step:12:7081. We see this log line once:
We see this one 39 times (where the broker's number changes):
The connection to the proxy that fails is associated with this log line:
Intriguingly, port 36040 is never referenced as having connected to the broker. That makes sense from the proxy side, but it doesn't make sense from the client side because the client sent a |
I am pretty confident I found the root cause. If you update the
The one detail I haven't found is why the callback is sometimes run late. Note that this explanation aligns closely with the earlier observation that we didn't see the client log about connecting to the broker. The log would have come from here: pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java Lines 264 to 270 in f360807
|
One explanation could be that it is dependent on which thread we're in, as this code suggests: |
Awesome job on the investigation @michaeljmarshall ! |
Describe the bug
It is common that UnsupportedOperationExceptions appear on the Proxy logs. This particular issue was reproduced very often when Geo-replication was configured between 2 clusters.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The race conditions should be handled in Pulsar Proxy
The text was updated successfully, but these errors were encountered: