Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blockingUnaryCall hangs forever if retries are enabled #10838

Closed
turchenkoalex opened this issue Jan 19, 2024 · 1 comment
Closed

blockingUnaryCall hangs forever if retries are enabled #10838

turchenkoalex opened this issue Jan 19, 2024 · 1 comment

Comments

@turchenkoalex
Copy link

turchenkoalex commented Jan 19, 2024

What version of gRPC-Java are you using?

1.61.0 with grpc-netty-shaded

What is your environment?

jdk 17, linux

What did you expect to see?

The call of blockingUnaryCall should not block forever

What did you see instead?

Thread is hanging even if use a deadline.

jstack with problem

  "main" #1 [8707] prio=5 os_prio=31 cpu=207.44ms elapsed=16.32s tid=0x0000000132808200 nid=8707 waiting on condition  [0x000000016fcda000]
     java.lang.Thread.State: WAITING (parking)
  	at jdk.internal.misc.Unsafe.park(java.base@21.0.1/Native Method)
  	- parking to wait for  <0x000000061e400010> (a io.grpc.stub.ClientCalls$ThreadlessExecutor)
  	at java.util.concurrent.locks.LockSupport.park(java.base@21.0.1/LockSupport.java:221)
  	at io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:717)
  	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:159)
  	at com.example.ExampleServiceGrpc$ExampleServiceBlockingStub.unaryCall(ExampleServiceGrpc.java:160)
  	at com.example.TestClient.call(TestClient.java:70)
  	at com.example.TestClient.callWithError(TestClient.java:53)
  	at com.example.Main.main(Main.java:52)

Steps to reproduce the bug

I'm using custom service config with enabled retries. The grpc blocking call is made with a deadline.

If the deadline occurs while waiting for the grpc call to retry, then a hang occurs.

    Duration deadline = Duration.ofSeconds(1);
    Map<String, ?> serviceConfig = Map.of("methodConfig",
            List.of(
                    Map.of(
                            "name", List.of(Map.of()),
                            "retryPolicy", Map.of(
                                    "maxAttempts", 4D,
                                    "initialBackoff", "10s",
                                    "maxBackoff", "10s",
                                    "backoffMultiplier", 1D,
                                    "retryableStatusCodes", List.of("UNAVAILABLE", "UNKNOWN")
                            )
                    )
            )
    );

    var channel = ManagedChannelBuilder.forAddress("localhost", port)
                .usePlaintext()
                .enableRetry()
                .disableServiceConfigLookUp()
                .defaultServiceConfig(serviceConfig);
    
    var blockingStub = ExampleServiceGrpc.newBlockingStub(channel);

    // If the first call to grpc returns UNKNOWN or the server is unavailable, a hang will occur.
    blockingStub
        .withDeadlineAfter(deadline.toMillis(), TimeUnit.MILLISECONDS)
        .unaryCall(req);

I've prepared an example in a separate repository because it is difficult to reproduce. Managed to find a reproducing of the bug.
Example to reproduce here - https://github.com/turchenkoalex/grpc-hang

I tested this behavior with several versions of the grpc library and found that the issue appeared starting from version 1.52.0

As far as I understand, the problem is related to the substream counter inFlightSubStreams in io.grpc.internal.RetriableStream.

When DeadlineTimer calls the cancel method on RetriableStream, the retryFuture also is canceled, but the inFlightSubStreams counter is not decremented. And for this reason, the safeCloseMasterListener method does not close the master channel.

I tried doing an explicit decrement on retryFuture.cancel and testing the new behavior. The hang is gone.

Here's the diff:

diff --git a/core/src/main/java/io/grpc/internal/RetriableStream.java b/core/src/main/java/io/grpc/internal/RetriableStream.java
index f301eee1f..07fe9e764 100644
--- a/core/src/main/java/io/grpc/internal/RetriableStream.java
+++ b/core/src/main/java/io/grpc/internal/RetriableStream.java
@@ -195,7 +195,10 @@ abstract class RetriableStream<ReqT> implements ClientStream {
             }
           }
           if (retryFuture != null) {
-            retryFuture.cancel(false);
+            boolean cancelled = retryFuture.cancel(false);
+            if (cancelled) {
+              inFlightSubStreams.decrementAndGet();
+            }
           }
           if (hedgingFuture != null) {
             hedgingFuture.cancel(false);

But maybe I'm missing something

@ejona86
Copy link
Member

ejona86 commented Jan 19, 2024

Duplicate of #10336

That looks like some good investigation. That other issue was stalled on the investigation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants