You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
curl -X GET http://192.168.0.208:9200/_cat/recovery?active_only=true
# empty response
the node log says shard started action is dropped by master node because of circuit breaker:
[2023-01-10T07:45:10,145][WARN ][o.e.c.a.s.ShardStateAction] [prod-es-ess-esn-2-1] unexpected failure while sending request [internal:cluster/shard/started] to [{prod-es-ess-esn-3-1}{J3pDbQD9TRayTunq585UDg}{WSW5HgtPSiGCzC2_PONNJg}{192.168.0.164}{192.168.0.164:9300}{dimr}] for shard entry [StartedShardEntry{shardId [[2059-bpmbussprojectdmg_209-bpmm209_bpdmgmodel_219_log_replica_sdm_archive_es-20221216093814][2]], allocationId [HzHB6jpSSKajaJK5De3kNw], primary term [3], message [after peer recovery]}]
org.elasticsearch.transport.RemoteTransportException: [prod-es-ess-esn-3-1][192.168.0.164:9300][internal:cluster/shard/started]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:cluster/shard/started] would be [8172567858/7.6gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8172567536/7.6gb], new bytes reserved: [322/322b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=322/322b, accounting=25496308/24.3mb]
at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:364) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:109) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:211) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:120) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:140) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:77) ~[?:?]
at org.elasticsearch.transport.netty4.Netty4HeartBeatChannelHandler.channelRead(Netty4HeartBeatChannelHandler.java:40) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1371) ~[?:?]
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234) ~[?:?]
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
I have to reroute retry failed to make the shard initialize to go, but after some while, it stucks again...
Circuit-breaking on these messages is the correct behaviour: if the master is overloaded, we want to push back on the rest of the cluster. If we didn't, the master would just go OOM.
Moreover, quoting the bug report form:
Please also check your OS is supported, and that the version of Elasticsearch has not passed end-of-life. If you are using an unsupported OS or an unsupported version then the issue is likely to be closed.
You are using 7.10.2 which is long past EOL, and newer versions are much less likely to circuit-break on the master (see e.g. #77466). Therefore I am closing this.
Elasticsearch Version
7.x
Installed Plugins
No response
Java Version
bundled
OS Version
Linux HOSTNAME 3.10.0-1160.31.1.el7.x86_64 #1 SMP Thu Jun 10 13:30:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Problem Description
I have a cluster stuck at shard initializing
but cat recovery api returns nothing:
the node log says shard started action is dropped by master node because of circuit breaker:
I have to reroute retry failed to make the shard initialize to go, but after some while, it stucks again...
I think internal actions like shard started/failure should not trigger circuit breaker, we should change this code https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java#L100 to
Steps to Reproduce
High jvm usage cluster can reproduce this issue
Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered: