Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] opensearch crashes on closed client connection before search reply when total ops higher compared to expected #3962

Closed
dbonf opened this issue Jul 20, 2022 · 5 comments · Fixed by #4143
Labels
backport 1.3 Backport to 1.3 branch backport 2.0 Backport to 2.0 branch backport 2.1 backport 2.2 Backport to 2.2 branch bug Something isn't working Indexing & Search v2.2.0 v2.3.0 'Issues and PRs related to version v2.3.0' v3.0.0 Issues and PRs related to version 3.0.0

Comments

@dbonf
Copy link

dbonf commented Jul 20, 2022

Describe the bug
Issue described in #3557 is still there also with opensearch 1.3.4.

In some circumstances, when running a search operation, if the client TCP connection is closed before the search operation is completed, opensearch crashes with:

[2022-07-20T17:51:36,145][ERROR][org.opensearch.bootstrap.OpenSearchUncaughtExceptionHandler] fatal error in thread [opensearch[mycluster-elasticsearch-3][search][T#8]], exiting
java.lang.AssertionError: unexpected higher total ops [15] compared to expected [14]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:465) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$2(AbstractSearchAsyncAction.java:303) ~[opensearch-1.3.4.jar:1.3.4]

The problem was supposedly solved with #3626 but we are still affected by it. The fix addresses the problem when totalOps == expectedTotalOps but not when totalOps > expectedTotalOps, when the exception is still unhandled and lead to a crash.

To Reproduce
Steps to reproduce the behaviour:

We can consistently reproduce this behaviour in several environments.

  1. run the long running search query, e.g. curl -s "http://localhost:9200/myindex*/_search (this is the same with any client/library)
  2. kill the client connection after a couple of seconds, e.g stop the curl command with ctrl+c (same as a connection that is closed by the client after a timeout, before the reply from the OS node)
  3. the opensearch node crashes

Expected behavior
No crashes.

Plugins

  • repository-s3
  • repository-gcs

Screenshots
no applicable

Host/Environment (please complete the following information):

  • OS: Red Hat Enterprise Linux 8 (in a container, UBI minimal)
  • Version: Opensearch 1.3.4

Additional context
more logs:

[2022-07-20T17:51:36,145][ERROR][org.opensearch.bootstrap.OpenSearchUncaughtExceptionHandler] fatal error in thread [opensearch[mycluster-elasticsearch-3][search][T#8]], exiting
java.lang.AssertionError: unexpected higher total ops [15] compared to expected [14]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:465) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$2(AbstractSearchAsyncAction.java:303) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:338) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.3.4.jar:1.3.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.opensearch.action.search.SearchPhaseExecutionException: Shard failures
	... 10 more
Caused by: org.opensearch.tasks.TaskCancelledException: The parent task was cancelled, shouldn't start any child tasks
	at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:534) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:226) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:731) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:838) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:826) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:196) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:124) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:278) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction$PendingExecutions.tryRun(AbstractSearchAsyncAction.java:781) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:310) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.run(AbstractSearchAsyncAction.java:249) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:415) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:215) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.TransportSearchAction$5.run(TransportSearchAction.java:1131) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:415) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:409) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:685) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:458) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:82) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:291) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:72) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:772) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:838) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.SearchTransportService.sendCanMatch(SearchTransportService.java:149) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.CanMatchPreFilterSearchPhase.executePhaseOnShard(CanMatchPreFilterSearchPhase.java:128) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:278) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:312) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:469) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:82) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:291) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:72) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.TransportService$6.handleException(TransportService.java:742) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:308) ~[?:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1357) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:415) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:296) ~[opensearch-1.3.4.jar:1.3.4]
TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]
	at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:534)
	at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:226)
	at org.opensearch.action.support.TransportAction.registerChildNode(TransportAction.java:70)
	at org.opensearch.action.support.TransportAction.execute(TransportAction.java:89)
	at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
	at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
	at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
	at org.opensearch.client.support.AbstractClient.search(AbstractClient.java:569)
	at org.opensearch.action.search.TransportMultiSearchAction.executeSearch(TransportMultiSearchAction.java:175)
	at org.opensearch.action.search.TransportMultiSearchAction$1.handleResponse(TransportMultiSearchAction.java:198)
	at org.opensearch.action.search.TransportMultiSearchAction$1.onFailure(TransportMultiSearchAction.java:183)
	at org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:112)
	at org.opensearch.action.ActionListener$5.onFailure(ActionListener.java:273)
	at org.opensearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:676)
	at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:650)
	at org.opensearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:126)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
	at org.opensearch.transport.InboundHandler.handleException(InboundHandler.java:413) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:405) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:153) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:109) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:759) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:170) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:145) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:110) ~[opensearch-1.3.4.jar:1.3.4]
	at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:94) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1373) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1236) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1285) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	... 1 more
TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]
	at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:534)
	at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:226)
	at org.opensearch.action.support.TransportAction.registerChildNode(TransportAction.java:70)
	at org.opensearch.action.support.TransportAction.execute(TransportAction.java:89)
	at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
	at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
	at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
	at org.opensearch.client.support.AbstractClient.search(AbstractClient.java:569)
	at org.opensearch.action.search.TransportMultiSearchAction.executeSearch(TransportMultiSearchAction.java:175)
	at org.opensearch.action.search.TransportMultiSearchAction$1.handleResponse(TransportMultiSearchAction.java:198)
	at org.opensearch.action.search.TransportMultiSearchAction$1.onFailure(TransportMultiSearchAction.java:183)
	at org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:112)
	at org.opensearch.action.ActionListener$5.onFailure(ActionListener.java:273)
	at org.opensearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:676)
	at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:650)
	at org.opensearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:126)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]
	at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:534)
	at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:226)
	at org.opensearch.action.support.TransportAction.registerChildNode(TransportAction.java:70)
	at org.opensearch.action.support.TransportAction.execute(TransportAction.java:89)
fatal error in thread [opensearch[mycluster-elasticsearch-3][search][T#11]], exiting
	at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
	at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
	at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
	at org.opensearch.client.support.AbstractClient.search(AbstractClient.java:569)
	at org.opensearch.action.search.TransportMultiSearchAction.executeSearch(TransportMultiSearchAction.java:175)
	at org.opensearch.action.search.TransportMultiSearchAction$1.handleResponse(TransportMultiSearchAction.java:198)
	at org.opensearch.action.search.TransportMultiSearchAction$1.onFailure(TransportMultiSearchAction.java:183)
	at org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:112)
	at org.opensearch.action.ActionListener$5.onFailure(ActionListener.java:273)
	at org.opensearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:676)
	at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:650)
	at org.opensearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:126)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
java.lang.AssertionError: unexpected higher total ops [15] compared to expected [14]
fatal error in thread [opensearch[mycluster-elasticsearch-3][search][T#8]], exiting
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:465)
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$2(AbstractSearchAsyncAction.java:303)
	at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:338)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
	at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: Failed to execute phase [query], Shard failures; shardFailures {[RbCmgI_7QdeNxfoPnd-lZQ][myindex-2022.07.20][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[PqhBStI8SFy7PQORdpuyYw][myindex-2022.07.19][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.07.18][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.07.17][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RbCmgI_7QdeNxfoPnd-lZQ][myindex-2022.07.16][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[wSteezs3T3qBVWaXUJQwTg][myindex-2022.07.15][0]: NodeNotConnectedException[[mycluster-elasticsearch-2][172.16.19.33:9300] Node not connected]}{[wSteezs3T3qBVWaXUJQwTg][myindex-2022.07.14][0]: NodeNotConnectedException[[mycluster-elasticsearch-2][172.16.19.33:9300] Node not connected]}{[wSteezs3T3qBVWaXUJQwTg][myindex-2022.07.13][0]: NodeNotConnectedException[[mycluster-elasticsearch-2][172.16.19.33:9300] Node not connected]}{[RbCmgI_7QdeNxfoPnd-lZQ][myindex-2022.07.12][0]: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]}
	... 10 more
Caused by: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]
	at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:534)
	at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:226)
	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:731)
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:838)
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:826)
	at org.opensearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:196)
	at org.opensearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:124)
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:278)
	at org.opensearch.action.search.AbstractSearchAsyncAction$PendingExecutions.tryRun(AbstractSearchAsyncAction.java:781)
	at org.opensearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:310)
	at org.opensearch.action.search.AbstractSearchAsyncAction.run(AbstractSearchAsyncAction.java:249)
	at org.opensearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:415)
	at org.opensearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:215)
	at org.opensearch.action.search.TransportSearchAction$5.run(TransportSearchAction.java:1131)
	at org.opensearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:415)
	at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:409)
	at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:685)
	at org.opensearch.action.search.AbstractSearchAsyncAction.successfulShardExecution(AbstractSearchAsyncAction.java:567)
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardResultConsumed(AbstractSearchAsyncAction.java:554)
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$onShardResult$8(AbstractSearchAsyncAction.java:537)
	at org.opensearch.action.search.CanMatchPreFilterSearchPhase$CanMatchSearchPhaseResults.consumeResult(CanMatchPreFilterSearchPhase.java:221)
	at org.opensearch.action.search.CanMatchPreFilterSearchPhase$CanMatchSearchPhaseResults.consumeResult(CanMatchPreFilterSearchPhase.java:205)
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardResult(AbstractSearchAsyncAction.java:537)
	at org.opensearch.action.search.AbstractSearchAsyncAction$1.innerOnResponse(AbstractSearchAsyncAction.java:282)
	at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:57)
	at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:42)
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:67)
	at org.opensearch.transport.TransportService$6.handleResponse(TransportService.java:736)
	at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleResponse(SecurityInterceptor.java:302)
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1347)
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:388)
	at org.opensearch.transport.InboundHandler.handleResponse(InboundHandler.java:380)
	at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:155)
	at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:109)
	at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:759)
	at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:170)
	at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:145)
	at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:110)
	at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:94)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1373)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1236)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1285)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	... 1 more
java.lang.AssertionError: unexpected higher total ops [15] compared to expected [14]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:465)
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$2(AbstractSearchAsyncAction.java:303)
	at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:338)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
	at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: Failed to execute phase [query], Shard failures; shardFailures {[RbCmgI_7QdeNxfoPnd-lZQ][myindex-2022.07.20][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[PqhBStI8SFy7PQORdpuyYw][myindex-2022.07.19][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.07.18][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.07.17][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RbCmgI_7QdeNxfoPnd-lZQ][myindex-2022.07.15][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[wSteezs3T3qBVWaXUJQwTg][myindex-2022.07.14][0]: NodeNotConnectedException[[mycluster-elasticsearch-2][172.16.19.33:9300] Node not connected]}{[PqhBStI8SFy7PQORdpuyYw][myindex-2022.07.13][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RbCmgI_7QdeNxfoPnd-lZQ][myindex-2022.07.12][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RbCmgI_7QdeNxfoPnd-lZQ][myindex-2022.07.16][0]: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]}
	... 10 more
Caused by: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]
	at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:534)
	at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:226)
	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:731)
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:838)
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:826)
	at org.opensearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:196)
	at org.opensearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:124)
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:278)
	at org.opensearch.action.search.AbstractSearchAsyncAction$PendingExecutions.tryRun(AbstractSearchAsyncAction.java:781)
	at org.opensearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:310)
	at org.opensearch.action.search.AbstractSearchAsyncAction.run(AbstractSearchAsyncAction.java:249)
	at org.opensearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:415)
	at org.opensearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:215)
	at org.opensearch.action.search.TransportSearchAction$5.run(TransportSearchAction.java:1131)
	at org.opensearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:415)
	at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:409)
	at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:685)
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:458)
	at org.opensearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:82)
	at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:291)
	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:72)
	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:772)
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:838)
	at org.opensearch.action.search.SearchTransportService.sendCanMatch(SearchTransportService.java:149)
	at org.opensearch.action.search.CanMatchPreFilterSearchPhase.executePhaseOnShard(CanMatchPreFilterSearchPhase.java:128)
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:278)
	at org.opensearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:312)
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:469)
	at org.opensearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:82)
	at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:291)
	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:72)
	at org.opensearch.transport.TransportService$6.handleException(TransportService.java:742)
	at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:308)
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1357)
	at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:415)
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:296)
	at org.opensearch.transport.InboundHandler.handleException(InboundHandler.java:413)
	at org.opensearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:405)
	at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:153)
	at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:109)
	at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:759)
	at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:170)
	at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:145)
	at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:110)
	at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:94)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1373)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1236)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1285)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	... 1 more
@dbonf dbonf added bug Something isn't working untriaged labels Jul 20, 2022
@dbonf dbonf changed the title [BUG] opensearch crashes on closed client connection before search reply when total ops is higher compared to expected [BUG] opensearch crashes on closed client connection before search reply when total ops higher compared to expected Jul 20, 2022
@andrross
Copy link
Member

@reta Just FYI, since you worked on the original issue.

@dblock
Copy link
Member

dblock commented Jul 21, 2022

@dbonf Want to try to turn this into a unit test and artificially cause totalOps > expectedTotalOps (and a failure)?

@reta
Copy link
Collaborator

reta commented Jul 24, 2022

@dbonf thanks for details, I will take a look shortly

The fix addresses the problem when totalOps == expectedTotalOps but not when totalOps > expectedTotalOps, when the exception is still unhandled and lead to a crash.

The totalOps > expectedTotalOps should never happen, but I believe there is several flows which cause "over counting" (nested calls?) which I will try to reconstruct.

@dbonf Want to try to turn this into a unit test and artificially cause totalOps > expectedTotalOps (and a failure)?

@dblock I added a test case as part of [1] which I thought is the same problem (turned out there is more than one flow). I will try to reproduce the exact failure (as per stack trace) although it is not easy.

[1] #3626

@dbonf
Copy link
Author

dbonf commented Jul 25, 2022

@reta for example with few indices with one replica, two shards each and several thousand documents, few nodes, this query (designed to be slow to trigger the bug):

curl -s 'https://mycluster/myindex*/_search?pretty'  -H 'Content-Type: application/json' -d'
{
  "size": 10000,
  "query": {
    "bool": {
      "must": [
        {
          "regexp": {
            "field_1": "xxx-xxxx-[a-z].*"
          }
        },
        {
          "regexp": {
            "field_2": ".*xyz.*"
          }
        },
        {
          "regexp": {
              "field_3": ".*re.*abcde.*"
            }
        }
      ],
      "filter": [
        {
          "range": {
            "field_4": {
              "gte": 0
            }
          }
        },
        {
          "range": {
            "field_4": {
              "lte": 1654793591000000
            }
          }
        }
      ]
    }
  },
  "collapse": {
    "field": "field_1",
    "inner_hits": {
      "name": "most_recent",
      "size": 1,
      "sort": [
        {
          "field_1": "desc"
        }
      ]
    }
  }
}
'

killed before the response triggers the crash, this is especially problematic when a client has retries in its logic so will kill many nodes in few seconds.

mbovo added a commit to draios/OpenSearch that referenced this issue Jul 25, 2022
@reta
Copy link
Collaborator

reta commented Aug 4, 2022

@dbonf I think I deciphered the flow, what is happening:

  • when all shards are handled (fail or success, doesn't really matter), onPhaseDone() is called
  • at some point, onPhaseDone() calls the SearchTransportService.sendCanMatch(), which fails before the request is send (task registration)
  • this exception is handled in different thread (forked, invisible in a sense to onPhaseDone()) which calls onShardFailure again, overflowing the operations counter

Working on the test and thinking about the fix (not obvious at the moment).

@reta reta added v2.3.0 'Issues and PRs related to version v2.3.0' backport 2.0 Backport to 2.0 branch backport 1.3 Backport to 1.3 branch backport 2.1 backport 2.2 Backport to 2.2 branch v3.0.0 Issues and PRs related to version 3.0.0 v2.2.0 labels Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.3 Backport to 1.3 branch backport 2.0 Backport to 2.0 branch backport 2.1 backport 2.2 Backport to 2.2 branch bug Something isn't working Indexing & Search v2.2.0 v2.3.0 'Issues and PRs related to version v2.3.0' v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
4 participants