Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpCient fails to recover after multiple ReadTimeoutException #8727

Closed
rbhambhra opened this issue Feb 7, 2023 · 8 comments
Closed

HttpCient fails to recover after multiple ReadTimeoutException #8727

rbhambhra opened this issue Feb 7, 2023 · 8 comments
Assignees
Labels
closed: notabug The issue is not a bug

Comments

@rbhambhra
Copy link

Expected Behavior

If HttpClient gets a high number of ReadTimeoutException, then after the underlying service has recovered or the volume of calls has gone down the HttpClient should invoke the underlying service

Actual Behaviour

Once HttpClient experiences a high volume of ReadTimeoutException, it fails to establish connections and serve further requests. To recover the application needs to be restarted

Steps To Reproduce

The sample application has a single endpoint on a controller that accepts the number of concurrent calls to be made to a simple endpoint (get on https://httpstat.us/200)

  1. Clone the repo

  2. Start the application
    ./gradlew run

  3. Check backend service directly
    curl https://httpstat.us/200 -k

  4. Check a single call to backend via HttpClient
    curl http://localhost:8080/load

  5. Stress HttpClient (this typically times out in under a minute, but up the attempts if it doesnt)
    curl http://localhost:8080/load?attempts=500000

  6. Retry a single attempt as per step 4. Observe that it results in a ReadTimeoutException

  7. Retry step 3 to confirm that direct calls to underlying service as working

Please note that once HttpClient starts failing, no tcp activity to httpstatus.us is observed via wireshark

Logs once it starts failing

3:37:41.658 [default-nioEventLoopGroup-1-16] TRACE i.m.h.server.netty.NettyHttpServer - Server localhost:8080 Received Request: GET /load?attempts=1 13:37:41.658 [default-nioEventLoopGroup-1-16] DEBUG i.m.h.s.netty.RoutingInBoundHandler - Request GET /load?attempts=1 13:37:41.658 [default-nioEventLoopGroup-1-16] TRACE i.m.h.s.netty.RoutingInBoundHandler - Matched route GET - /load to controller class com.example.LoadController 13:37:41.658 [default-nioEventLoopGroup-1-16] TRACE i.m.http.server.cors.CorsFilter - Http Header Origin not present. Proceeding with the request. 13:37:41.658 [default-nioEventLoopGroup-1-16] TRACE i.m.aop.chain.InterceptorChain - Intercepted method [Mono get()] invocation on target: com.example.HttpStatusClient$Intercepted@5c8eee0f 13:37:41.658 [default-nioEventLoopGroup-1-16] TRACE i.m.aop.chain.InterceptorChain - Proceeded to next interceptor [io.micronaut.retry.intercept.RecoveryInterceptor@478ee483] in chain for method invocation: Mono get() 13:37:41.659 [default-nioEventLoopGroup-1-16] TRACE i.m.aop.chain.InterceptorChain - Proceeded to next interceptor [io.micronaut.http.client.interceptor.HttpClientIntroductionAdvice@4189d70b] in chain for method invocation: Mono get() 13:37:41.659 [default-nioEventLoopGroup-1-16] DEBUG i.m.c.e.ApplicationEventPublisher - Publishing event: io.micronaut.http.context.event.HttpRequestReceivedEvent[source=GET /load?attempts=1] 13:37:44.167 [parallel-6] ERROR i.m.http.server.RouteExecutor - Unexpected error occurred: Read Timeout io.micronaut.http.client.exceptions.ReadTimeoutException: Read Timeout     at io.micronaut.http.client.exceptions.ReadTimeoutException.<clinit>(ReadTimeoutException.java:26)     at io.micronaut.http.client.netty.DefaultHttpClient$BaseHttpResponseHandler.exceptionCaught(DefaultHttpClient.java:2040)     at io.micronaut.http.client.netty.DefaultHttpClient$FullHttpResponseHandler.exceptionCaught(DefaultHttpClient.java:2292)     at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)     at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)     at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)     at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireExceptionCaught(CombinedChannelDuplexHandler.java:424)     at io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:92)     at io.netty.channel.CombinedChannelDuplexHandler$1.fireExceptionCaught(CombinedChannelDuplexHandler.java:145)     at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:143)     at io.netty.channel.CombinedChannelDuplexHandler.exceptionCaught(CombinedChannelDuplexHandler.java:231)     at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)     at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)     at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)     at io.netty.handler.timeout.ReadTimeoutHandler.readTimedOut(ReadTimeoutHandler.java:98)     at io.netty.handler.timeout.ReadTimeoutHandler.channelIdle(ReadTimeoutHandler.java:90)     at io.netty.handler.timeout.IdleStateHandler$ReaderIdleTimeoutTask.run(IdleStateHandler.java:503)     at io.netty.handler.timeout.IdleStateHandler$AbstractIdleTask.run(IdleStateHandler.java:475)     at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)     at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)     at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)     at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)     at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)     at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)     at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)     at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)     at java.base/java.lang.Thread.run(Thread.java:833) 13:37:44.174 [parallel-6] TRACE i.m.h.s.netty.RoutingInBoundHandler - Encoding emitted response object [Internal Server Error] using codec: io.micronaut.json.codec.JsonMediaTypeCodec@4167d97b 13:37:44.175 [parallel-6] DEBUG i.m.h.s.netty.RoutingInBoundHandler - Response 500 - GET /load?attempts=1 13:37:44.177 [default-nioEventLoopGroup-1-16] DEBUG i.m.c.e.ApplicationEventPublisher - Publishing event: io.micronaut.http.context.event.HttpRequestTerminatedEvent[source=GET /load?attempts=1] 13:37:44.177 [default-nioEventLoopGroup-1-16] TRACE i.m.c.e.ApplicationEventPublisher - Established event listeners [io.micronaut.runtime.http.scope.RequestCustomScope@17593fa] for event: io.micronaut.http.context.event.HttpRequestTerminatedEvent[source=GET /load?attempts=1] 13:37:44.177 [default-nioEventLoopGroup-1-16] TRACE i.m.c.e.ApplicationEventPublisher - Invoking event listener [io.micronaut.runtime.http.scope.RequestCustomScope@17593fa] for event: io.micronaut.http.context.event.HttpRequestTerminatedEvent[source=GET /load?attempts=1]

Environment Information

Windows 11
openjdk 17.0.2 2022-01-18
OpenJDK Runtime Environment (build 17.0.2+8-86)
OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)

Example Application

https://github.com/rbhambhra/httpclient

Version

3.8.3

@lordbuddha
Copy link

FYI. Same behavior on Mac OS and Linux.

@lordbuddha
Copy link

I did a variant of this project using the Jetty 11.x HttpClient in async mode, under /load2 controller.

Got the same performance but no breakages.

@graemerocher
Copy link
Contributor

@yawkat can you investigate?

@yawkat
Copy link
Member

yawkat commented Feb 8, 2023

Okay from what I can tell this is because the netty channel pool tries to acquire all these connections, and they just get stuck in the queue in SimpleChannelPool. The connection requests are cancelled because of the timeout, but SimpleChannelPool does not handle this. So it takes a while to work through all the timeouts.

Unfortunately I don't see a way to hook SimpleChannelPool to fix this issue. There would need to be a listener on the acquire promise that cancels the connectChannel future if the acquire promise is cancelled. But that is all on the netty side.

With Micronaut 4 and the changes in #8100 this is fixed. It needs to be reevaluated with #8179 since that PR affects request cancellation, but I think it should still work fine because for this issue the requests are cancelled before filters ever run.

@lordbuddha
Copy link

Thanks. Ouch. I assume no plan to back-port fixes from 4.x to 3.x.
Is there a rough time-frame for 4.0 release (actually we would wait for 4.0.1 at least).
In the meantime because of the terminal consequence of this, we are working on using an alternative Jetty HttpClient.

@yawkat
Copy link
Member

yawkat commented Feb 10, 2023

Unfortunately the changes in 4.0 that fix this issue are pretty wide-reaching and not compatible enough for a minor release.

You can try adding a low acquire timeout. It seems that this makes recovery a bit faster.

@graemerocher
Copy link
Contributor

using a connect-timeout of 10s seems the problem to me. You are flooding the connection pool with attempts to create a connection that all have to timeout. It will eventually recover but has to wait for all of this connection acquisition attempts to timeout first

@rbhambhra
Copy link
Author

@yawkat and @graemerocher

With a high connect timeout, calls to our application fill up heap as calls to http client take so long. This actually (inadvertently) works to our advantage as within a few minutes we run out of memory and the instance is replaced by our cloud provider.

I have tried a low timeout (1.5s) as well as circuit breaker. The application no longer crashes BUT the http client does not recover (in a test environment despite waiting 4 days the http client did not recover). So a lack of resilience actually protects us

Going forward am going to work with @lordbuddha and use the jetty http client with resilience4j (retry, circuit breaker and time limiter).

FYI - While I can replicate this with an insane number of calls in test environment, we have only seen this occur in our production environment twice in the last year. Load on our production system is not at extreme levels (about 60k requests /min distributed to 3 quad core servers)

thanks

@graemerocher graemerocher added the closed: notabug The issue is not a bug label Oct 20, 2023
@yawkat yawkat closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed: notabug The issue is not a bug
Projects
None yet
Development

No branches or pull requests

4 participants