HttpCient fails to recover after multiple ReadTimeoutException #8727

rbhambhra · 2023-02-07T01:55:49Z

Expected Behavior

If HttpClient gets a high number of ReadTimeoutException, then after the underlying service has recovered or the volume of calls has gone down the HttpClient should invoke the underlying service

Actual Behaviour

Once HttpClient experiences a high volume of ReadTimeoutException, it fails to establish connections and serve further requests. To recover the application needs to be restarted

Steps To Reproduce

The sample application has a single endpoint on a controller that accepts the number of concurrent calls to be made to a simple endpoint (get on https://httpstat.us/200)

Clone the repo
Start the application
./gradlew run
Check backend service directly
curl https://httpstat.us/200 -k
Check a single call to backend via HttpClient
curl http://localhost:8080/load
Stress HttpClient (this typically times out in under a minute, but up the attempts if it doesnt)
curl http://localhost:8080/load?attempts=500000
Retry a single attempt as per step 4. Observe that it results in a ReadTimeoutException
Retry step 3 to confirm that direct calls to underlying service as working

Please note that once HttpClient starts failing, no tcp activity to httpstatus.us is observed via wireshark

Logs once it starts failing

3:37:41.658 [default-nioEventLoopGroup-1-16] TRACE i.m.h.server.netty.NettyHttpServer - Server localhost:8080 Received Request: GET /load?attempts=1 13:37:41.658 [default-nioEventLoopGroup-1-16] DEBUG i.m.h.s.netty.RoutingInBoundHandler - Request GET /load?attempts=1 13:37:41.658 [default-nioEventLoopGroup-1-16] TRACE i.m.h.s.netty.RoutingInBoundHandler - Matched route GET - /load to controller class com.example.LoadController 13:37:41.658 [default-nioEventLoopGroup-1-16] TRACE i.m.http.server.cors.CorsFilter - Http Header Origin not present. Proceeding with the request. 13:37:41.658 [default-nioEventLoopGroup-1-16] TRACE i.m.aop.chain.InterceptorChain - Intercepted method [Mono get()] invocation on target: com.example.HttpStatusClient$Intercepted@5c8eee0f 13:37:41.658 [default-nioEventLoopGroup-1-16] TRACE i.m.aop.chain.InterceptorChain - Proceeded to next interceptor [io.micronaut.retry.intercept.RecoveryInterceptor@478ee483] in chain for method invocation: Mono get() 13:37:41.659 [default-nioEventLoopGroup-1-16] TRACE i.m.aop.chain.InterceptorChain - Proceeded to next interceptor [io.micronaut.http.client.interceptor.HttpClientIntroductionAdvice@4189d70b] in chain for method invocation: Mono get() 13:37:41.659 [default-nioEventLoopGroup-1-16] DEBUG i.m.c.e.ApplicationEventPublisher - Publishing event: io.micronaut.http.context.event.HttpRequestReceivedEvent[source=GET /load?attempts=1] 13:37:44.167 [parallel-6] ERROR i.m.http.server.RouteExecutor - Unexpected error occurred: Read Timeout io.micronaut.http.client.exceptions.ReadTimeoutException: Read Timeout at io.micronaut.http.client.exceptions.ReadTimeoutException.<clinit>(ReadTimeoutException.java:26) at io.micronaut.http.client.netty.DefaultHttpClient$BaseHttpResponseHandler.exceptionCaught(DefaultHttpClient.java:2040) at io.micronaut.http.client.netty.DefaultHttpClient$FullHttpResponseHandler.exceptionCaught(DefaultHttpClient.java:2292) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325) at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317) at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireExceptionCaught(CombinedChannelDuplexHandler.java:424) at io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:92) at io.netty.channel.CombinedChannelDuplexHandler$1.fireExceptionCaught(CombinedChannelDuplexHandler.java:145) at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:143) at io.netty.channel.CombinedChannelDuplexHandler.exceptionCaught(CombinedChannelDuplexHandler.java:231) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325) at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317) at io.netty.handler.timeout.ReadTimeoutHandler.readTimedOut(ReadTimeoutHandler.java:98) at io.netty.handler.timeout.ReadTimeoutHandler.channelIdle(ReadTimeoutHandler.java:90) at io.netty.handler.timeout.IdleStateHandler$ReaderIdleTimeoutTask.run(IdleStateHandler.java:503) at io.netty.handler.timeout.IdleStateHandler$AbstractIdleTask.run(IdleStateHandler.java:475) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:833) 13:37:44.174 [parallel-6] TRACE i.m.h.s.netty.RoutingInBoundHandler - Encoding emitted response object [Internal Server Error] using codec: io.micronaut.json.codec.JsonMediaTypeCodec@4167d97b 13:37:44.175 [parallel-6] DEBUG i.m.h.s.netty.RoutingInBoundHandler - Response 500 - GET /load?attempts=1 13:37:44.177 [default-nioEventLoopGroup-1-16] DEBUG i.m.c.e.ApplicationEventPublisher - Publishing event: io.micronaut.http.context.event.HttpRequestTerminatedEvent[source=GET /load?attempts=1] 13:37:44.177 [default-nioEventLoopGroup-1-16] TRACE i.m.c.e.ApplicationEventPublisher - Established event listeners [io.micronaut.runtime.http.scope.RequestCustomScope@17593fa] for event: io.micronaut.http.context.event.HttpRequestTerminatedEvent[source=GET /load?attempts=1] 13:37:44.177 [default-nioEventLoopGroup-1-16] TRACE i.m.c.e.ApplicationEventPublisher - Invoking event listener [io.micronaut.runtime.http.scope.RequestCustomScope@17593fa] for event: io.micronaut.http.context.event.HttpRequestTerminatedEvent[source=GET /load?attempts=1]

Environment Information

Windows 11
openjdk 17.0.2 2022-01-18
OpenJDK Runtime Environment (build 17.0.2+8-86)
OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)

Example Application

https://github.com/rbhambhra/httpclient

Version

3.8.3

The text was updated successfully, but these errors were encountered:

lordbuddha · 2023-02-07T03:29:52Z

FYI. Same behavior on Mac OS and Linux.

lordbuddha · 2023-02-07T20:18:51Z

I did a variant of this project using the Jetty 11.x HttpClient in async mode, under /load2 controller.

Got the same performance but no breakages.

graemerocher · 2023-02-08T08:38:06Z

@yawkat can you investigate?

yawkat · 2023-02-08T10:31:27Z

Okay from what I can tell this is because the netty channel pool tries to acquire all these connections, and they just get stuck in the queue in SimpleChannelPool. The connection requests are cancelled because of the timeout, but SimpleChannelPool does not handle this. So it takes a while to work through all the timeouts.

Unfortunately I don't see a way to hook SimpleChannelPool to fix this issue. There would need to be a listener on the acquire promise that cancels the connectChannel future if the acquire promise is cancelled. But that is all on the netty side.

With Micronaut 4 and the changes in #8100 this is fixed. It needs to be reevaluated with #8179 since that PR affects request cancellation, but I think it should still work fine because for this issue the requests are cancelled before filters ever run.

lordbuddha · 2023-02-09T21:30:06Z

Thanks. Ouch. I assume no plan to back-port fixes from 4.x to 3.x.
Is there a rough time-frame for 4.0 release (actually we would wait for 4.0.1 at least).
In the meantime because of the terminal consequence of this, we are working on using an alternative Jetty HttpClient.

yawkat · 2023-02-10T09:12:40Z

Unfortunately the changes in 4.0 that fix this issue are pretty wide-reaching and not compatible enough for a minor release.

You can try adding a low acquire timeout. It seems that this makes recovery a bit faster.

graemerocher · 2023-02-10T09:31:39Z

using a connect-timeout of 10s seems the problem to me. You are flooding the connection pool with attempts to create a connection that all have to timeout. It will eventually recover but has to wait for all of this connection acquisition attempts to timeout first

rbhambhra · 2023-02-11T01:29:44Z

@yawkat and @graemerocher

With a high connect timeout, calls to our application fill up heap as calls to http client take so long. This actually (inadvertently) works to our advantage as within a few minutes we run out of memory and the instance is replaced by our cloud provider.

I have tried a low timeout (1.5s) as well as circuit breaker. The application no longer crashes BUT the http client does not recover (in a test environment despite waiting 4 days the http client did not recover). So a lack of resilience actually protects us

Going forward am going to work with @lordbuddha and use the jetty http client with resilience4j (retry, circuit breaker and time limiter).

FYI - While I can replicate this with an insane number of calls in test environment, we have only seen this occur in our production environment twice in the last year. Load on our production system is not at extreme levels (about 60k requests /min distributed to 3 quad core servers)

thanks

graemerocher assigned yawkat Feb 8, 2023

graemerocher added the closed: notabug The issue is not a bug label Oct 20, 2023

yawkat closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HttpCient fails to recover after multiple ReadTimeoutException #8727

HttpCient fails to recover after multiple ReadTimeoutException #8727

rbhambhra commented Feb 7, 2023

lordbuddha commented Feb 7, 2023

lordbuddha commented Feb 7, 2023

graemerocher commented Feb 8, 2023

yawkat commented Feb 8, 2023

lordbuddha commented Feb 9, 2023

yawkat commented Feb 10, 2023

graemerocher commented Feb 10, 2023

rbhambhra commented Feb 11, 2023

HttpCient fails to recover after multiple ReadTimeoutException #8727

HttpCient fails to recover after multiple ReadTimeoutException #8727

Comments

rbhambhra commented Feb 7, 2023

Expected Behavior

Actual Behaviour

Steps To Reproduce

Environment Information

Example Application

Version

lordbuddha commented Feb 7, 2023

lordbuddha commented Feb 7, 2023

graemerocher commented Feb 8, 2023

yawkat commented Feb 8, 2023

lordbuddha commented Feb 9, 2023

yawkat commented Feb 10, 2023

graemerocher commented Feb 10, 2023

rbhambhra commented Feb 11, 2023