Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netty connector hang up after repeated buffer overflow errors when writing data #5753

Closed
scottoaks17 opened this issue Oct 1, 2024 · 5 comments

Comments

@scottoaks17
Copy link

scottoaks17 commented Oct 1, 2024

We have some code that repeatedly makes async JAX-RS calls requests using the Netty Connector; the basic call is quite simple:

CompletableFuture<Response> cf = invokeResponse().toCompletableFuture()
                .whenComplete((rsp, t) -> {
                    if (t != null) {
                        System.out.println(Thread.currentThread() + " async complete. Caught exception " + t);
                    }
                })
                .handle((rsp, t) -> {
                    if (rsp != null) {
                        rsp.readEntity(String.class);
                    } else {
                        System.out.println(Thread.currentThread().getName() + " response is null");
                    }
                    return rsp;
                }).exceptionally(t -> {
                    System.out.println("async complete. completed exceptionally " + t);
                    throw new RuntimeException(t);
        });

The attached example program executes these calls within a loop, for a given number of threads. Under normal circumstances, this works fine: all the requests go through and get processed, and the test program exits. Sometimes, however, the test program will hang: it will simply cease processing the remaining results.

We have tracked this down to a problem when the writes to the remote system are buffered within JerseyChunkedInput.write(), which puts the data on a queue of size 8 (by default). If that queue is full, the the write() method throws an IOException. For a single request, this IOException is propagated up correctly; the CompletableFuture reports the exception and processing can continue. After some number of these exceptions, however, the program hangs: the netty stack has somehow lost track of its callbacks/promises.

The easiest way to reproduce this is to modify the JerseyChunkedInput.write() method to periodically throw an IOException, something like this:

private static java.util.concurrent.atomic.AtomicInteger ai = new java.util.concurrent.atomic.AtomicInteger(0);
    private void write(Provider<ByteBuffer> bufferSupplier) throws IOException {

        checkClosed();
    
        if ((ai.getAndIncrement() % 100) == 0) throw new IOException("BOGUS BUFFER OVERFLOW");
        try {
            boolean queued = queue.offer(bufferSupplier.get(), WRITE_TIMEOUT, TimeUnit.MILLISECONDS);
            if (!queued) {
                throw new IOException("Buffer overflow.");
            }

        } catch (InterruptedException e) {
            throw new IOException(e);
        }
    }

WIth that in place, the attached test program will run for awhile. After about 28 calls, it will get quite sluggish, and after about 34 calls it will hang altogether. (Those numbers may likely be different on other systems.

To run the attached program, the pom dependency is

        <dependency>
            <groupId>org.glassfish.jersey.connectors</groupId>
            <artifactId>jersey-netty-connector</artifactId>
            <version>2.45</version>
        </dependency>

Then it needs three arguments: the URL to call, the number of threads, and the number of time each thread should call, so something like:
java <cp> com.oracle.psr.nettybug.NettyBug http://100.105.9.29:7001/console/login/LoginForm.jsp 5 10

nettybug.tar.gz

@jbescos
Copy link
Member

jbescos commented Oct 2, 2024

I am running and the program finishes normally. Could you provide a thread dump when it hangs?

Oct 02, 2024 7:41:44 AM org.glassfish.jersey.client.innate.inject.NonInjectionManager <init>
WARNING: Jersey-HK2 module is missing. Falling back to injection-less client. Injection may not be supported on the client.
Oct 02, 2024 7:41:44 AM org.glassfish.jersey.message.internal.MessagingBinders$EnabledProvidersBinder bindToBinder
WARNING: A class javax.activation.DataSource for a default provider MessageBodyWriter<javax.activation.DataSource> was not found. The provider is not available.
Done call 3
Done call 2
Done call 1
Done call 4
Done call 5
Done call 6
Done call 7
Done call 8
Done call 9
Done call 10
Done call 11
Done call 12
Done call 13
Done call 14
Done call 15
Done call 16
Done call 17
Done call 18
Done call 19
Done call 20
Done call 21
Done call 22
Done call 23
Done call 24
Done call 25
Done call 26
Done call 27
Done call 28
Done call 29
Done call 30
Done call 31
Done call 32
Done call 33
Done call 34
Done call 35
Done call 36
Done call 37
Done call 38
Done call 39
Done call 40
Done call 41
Done call 42
Done call 43
Done call 44
Done call 45
Done call 46
Done call 47
Done call 48
Done call 49
Done call 50
Processed calls: 50

@scottoaks17
Copy link
Author

Did you inject the IOExceptions in the JerseyChunkedInput class as I suggested? Because I don't see that output in your test above; with the exceptions injected, I see this:

WARNING: A class javax.activation.DataSource for a default provider MessageBodyWriter<javax.activation.DataSource> was not found. The provider is not available.
Thread[pool-2-thread-6,5,main] async complete. Caught exception javax.ws.rs.ProcessingException: java.io.IOException: BOGUS BUFFER OVERFLOW
pool-2-thread-6 response is null
Done call 1
Thread[pool-2-thread-6,5,main] async complete. Caught exception javax.ws.rs.ProcessingException: java.io.IOException: BOGUS BUFFER OVERFLOW
pool-2-thread-6 response is null
Done call 2
Thread[pool-2-thread-1,5,main] async complete. Caught exception javax.ws.rs.ProcessingException: java.io.IOException: BOGUS BUFFER OVERFLOW
pool-2-thread-1 response is null
Done call 3
Thread[pool-2-thread-3,5,main] async complete. Caught exception javax.ws.rs.ProcessingException: java.io.IOException: BOGUS BUFFER OVERFLOW
pool-2-thread-3 response is null
Done call 6
Done call 7
Done call 8
Thread[pool-2-thread-3,5,main] async complete. Caught exception javax.ws.rs.ProcessingException: java.io.IOException: BOGUS BUFFER OVERFLOW
pool-2-thread-3 response is null
Done call 9
Done call 10
Done call 11
Done call 12
Thread[pool-2-thread-1,5,main] async complete. Caught exception javax.ws.rs.ProcessingException: java.io.IOException: BOGUS BUFFER OVERFLOW
pool-2-thread-1 response is null
Done call 13
Done call 14
Done call 15
Done call 16
Thread[pool-2-thread-1,5,main] async complete. Caught exception javax.ws.rs.ProcessingException: java.io.IOException: BOGUS BUFFER OVERFLOW
pool-2-thread-1 response is null
Done call 17
Done call 18
Done call 19
Done call 20
Thread[pool-2-thread-3,5,main] async complete. Caught exception javax.ws.rs.ProcessingException: java.io.IOException: BOGUS BUFFER OVERFLOW
pool-2-thread-3 response is null
Done call 21
Done call 22
Done call 23
Thread[pool-2-thread-1,5,main] async complete. Caught exception javax.ws.rs.ProcessingException: java.io.IOException: BOGUS BUFFER OVERFLOW
pool-2-thread-1 response is null
Done call 24
Done call 25
Thread[pool-2-thread-4,5,main] async complete. Caught exception javax.ws.rs.ProcessingException: java.io.IOException: BOGUS BUFFER OVERFLOW
pool-2-thread-4 response is null
Done call 26
Done call 27

At that point the test hung for me.

Here also is a thread dump of when it occurs. At least part of the problem occurs because thread nioEventLoopGroup-2-4 is in an infinite loop withint he doFlush() method. The method will, as long as the channel is writable, attempt to read from the JerseyChunkedInput queue that had the IOException. Though pulling from the queue times out every 2 seconds, the logic in doFlush() is such that the loop doesn't exit. It would guess (but I'm quite unfamiliar with the code) that the channel ought to have been closed (and hence unwritable) as part of the IOException handling.

netty-jstack.txt

@sxsubram
Copy link

sxsubram commented Oct 2, 2024

exceptions seen in production logs -
javax.ws.rs.ProcessingException: java.io.IOException: Stream closed
javax.ws.rs.ProcessingException: java.util.concurrent.TimeoutException: Stream closed: read timeout
javax.ws.rs.ProcessingException: io.netty.util.concurrent.BlockingOperationException: DefaultChannelPromise@1c3449d0(incomplete)
javax.ws.rs.ProcessingException: java.net.SocketException: Connection reset
javax.ws.rs.ProcessingException: java.io.IOException: Buffer overflow

@jbescos
Copy link
Member

jbescos commented Oct 3, 2024

I am reproducing it now.

It works if JerseyChunkedInput#write invokes #close when the exception happens.

jbescos added a commit to jbescos/jersey that referenced this issue Oct 3, 2024
…iting data eclipse-ee4j#5753

Signed-off-by: Jorge Bescos Gascon <jorge.bescos.gascon@oracle.com>
jbescos added a commit to jbescos/jersey that referenced this issue Oct 3, 2024
…iting data eclipse-ee4j#5753

Signed-off-by: Jorge Bescos Gascon <jorge.bescos.gascon@oracle.com>
jbescos added a commit to jbescos/jersey that referenced this issue Oct 4, 2024
…iting data eclipse-ee4j#5753

Signed-off-by: Jorge Bescos Gascon <jorge.bescos.gascon@oracle.com>
jbescos added a commit that referenced this issue Oct 4, 2024
…iting data #5753 (#5755)

Signed-off-by: Jorge Bescos Gascon <jorge.bescos.gascon@oracle.com>
@jbescos
Copy link
Member

jbescos commented Oct 7, 2024

Closing, as the fix was merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants