Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal action like shard started/failure should not trigger circuit breaker #92783

Closed
xiaoyuan0821 opened this issue Jan 10, 2023 · 1 comment
Labels
>bug needs:triage Requires assignment of a team area label

Comments

@xiaoyuan0821
Copy link

Elasticsearch Version

7.x

Installed Plugins

No response

Java Version

bundled

OS Version

Linux HOSTNAME 3.10.0-1160.31.1.el7.x86_64 #1 SMP Thu Jun 10 13:30:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

I have a cluster stuck at shard initializing

{
  "cluster_name" : "prod-es",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 33612,
  "active_shards" : 61211,
  "relocating_shards" : 0,
  "initializing_shards" : 41,
  "unassigned_shards" : 5961,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 91.07017987591686
}

but cat recovery api returns nothing:

curl -X GET  http://192.168.0.208:9200/_cat/recovery?active_only=true
# empty response

the node log says shard started action is dropped by master node because of circuit breaker:

[2023-01-10T07:45:10,145][WARN ][o.e.c.a.s.ShardStateAction] [prod-es-ess-esn-2-1] unexpected failure while sending request [internal:cluster/shard/started] to [{prod-es-ess-esn-3-1}{J3pDbQD9TRayTunq585UDg}{WSW5HgtPSiGCzC2_PONNJg}{192.168.0.164}{192.168.0.164:9300}{dimr}] for shard entry [StartedShardEntry{shardId [[2059-bpmbussprojectdmg_209-bpmm209_bpdmgmodel_219_log_replica_sdm_archive_es-20221216093814][2]], allocationId [HzHB6jpSSKajaJK5De3kNw], primary term [3], message [after peer recovery]}]
org.elasticsearch.transport.RemoteTransportException: [prod-es-ess-esn-3-1][192.168.0.164:9300][internal:cluster/shard/started]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:cluster/shard/started] would be [8172567858/7.6gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8172567536/7.6gb], new bytes reserved: [322/322b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=322/322b, accounting=25496308/24.3mb]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:364) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:109) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:211) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:120) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:140) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:77) ~[?:?]
	at org.elasticsearch.transport.netty4.Netty4HeartBeatChannelHandler.channelRead(Netty4HeartBeatChannelHandler.java:40) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1371) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]

I have to reroute retry failed to make the shard initialize to go, but after some while, it stucks again...

I think internal actions like shard started/failure should not trigger circuit breaker, we should change this code https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java#L100 to

        transportService.registerRequestHandler(SHARD_STARTED_ACTION_NAME, ThreadPool.Names.SAME, false, false, StartedShardEntry::new,
            new ShardStartedTransportHandler(clusterService,
                new ShardStartedClusterStateTaskExecutor(allocationService, rerouteService, () -> followUpRerouteTaskPriority, logger),
                logger));
        transportService.registerRequestHandler(SHARD_FAILED_ACTION_NAME, ThreadPool.Names.SAME, false, false, FailedShardEntry::new,
            new ShardFailedTransportHandler(clusterService,
                new ShardFailedClusterStateTaskExecutor(allocationService, rerouteService, () -> followUpRerouteTaskPriority, logger),
                logger));

Steps to Reproduce

High jvm usage cluster can reproduce this issue

Logs (if relevant)

No response

@xiaoyuan0821 xiaoyuan0821 added >bug needs:triage Requires assignment of a team area label labels Jan 10, 2023
@DaveCTurner
Copy link
Contributor

Circuit-breaking on these messages is the correct behaviour: if the master is overloaded, we want to push back on the rest of the cluster. If we didn't, the master would just go OOM.

Moreover, quoting the bug report form:

Please also check your OS is supported, and that the version of Elasticsearch has not passed end-of-life. If you are using an unsupported OS or an unsupported version then the issue is likely to be closed.

You are using 7.10.2 which is long past EOL, and newer versions are much less likely to circuit-break on the master (see e.g. #77466). Therefore I am closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug needs:triage Requires assignment of a team area label
Projects
None yet
Development

No branches or pull requests

2 participants