Nested RemoteTransportExceptions flood the logs and fill the disk #19187

nomoa · 2016-06-30T13:04:04Z

It happened during a rolling restart needed for a security upgrade. The cluster is running elastic 2.3.3.
All nodes are running the same JVM version (OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)).

A RemoteTransportException seemed to "loop?" between 2 nodes causing elastic to log bigger and bigger exception traces as a new RemoteException exception seemed to be created with the previous one carrying all its causes.

The first trace was (on elastic1045) :

[2016-06-30 08:34:20,553][WARN ][org.elasticsearch        ] Exception cause unwrapping ran for 10 levels...
RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s][p]]]; nested: IllegalIndexShardStateException[CurrentState[POST_RECOVERY] operation only allowed when started/recovering, origin [PRIMARY]];
[[11 lines of Caused by: RemoteTransportException]]
Caused by: [itwiki_general_1415230945][[itwiki_general_1415230945][2]] IllegalIndexShardStateException[CurrentState[POST_RECOVERY] operation only allowed when started/recovering, origin [PRIMARY]]
        at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:1062)
        at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:593)
        at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
        at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

The second one (same root cause) appeared few ms after with also 12 causes.
The third and fourth ones had 14 causes, fifth and sixth 16 causes and so on...
The last one I've seen had 1982 chained causes.

The logs were nearly the same on elastic1036 (master) generating 27gig of logs in few minutes on both nodes.

Surprisingly the cluster was still performing relatively well with higher gc activity on these nodes.

Then (maybe 1 hour after the first trace) elastic1045 was dropped from the cluster:

[2016-06-30 09:48:25,953][INFO ][discovery.zen            ] [elastic1045] master_left [{elastic1036}{DUOG0aGqQ3Gajr_wcFTOyw}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]

It was immediately re-added and the log flood stopped.

I'll comment on this ticket if it happens again.

The text was updated successfully, but these errors were encountered:

ywelsch · 2016-06-30T15:16:26Z

I think that this is a manifestation of #12573. It can happen when the target node of a primary relocation takes a long time to apply the cluster state that contains the information that it has the new primary. It is fixed in the upcoming v5.0.0 (#16274). The question is why the node took so long to apply a cluster state update. Is there anything else in the logs that might indicate this? What is the time stamp of the last log entry that has one of those huge exception traces?

nomoa · 2016-06-30T16:30:55Z

@ywelsch this is still unclear, the first trace was 08:34:20,553 and the last one was a StackOverflow at 09:48:19,128 :

[2016-06-30 09:48:19,128][WARN ][action.bulk              ] [elastic1036] Failed to send response for indices:data/write/bulk[s]
java.lang.StackOverflowError
        at java.lang.Exception.<init>(Exception.java:84)
        at java.lang.RuntimeException.<init>(RuntimeException.java:80)
        at org.elasticsearch.ElasticsearchException.<init>(ElasticsearchException.java:84)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.<init>(NotSerializableExceptionWrapper.java:41)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:560)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.writeTo(NotSerializableExceptionWrapper.java:65)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:564)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.writeTo(NotSerializableExceptionWrapper.java:65)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:564)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.writeTo(NotSerializableExceptionWrapper.java:65)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:564)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.writeTo(NotSerializableExceptionWrapper.java:65)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:564)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.writeTo(NotSerializableExceptionWrapper.java:65)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:564)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)

Note that It's happening again right now with two other nodes elastic1021 and elastic1036 (still master).

Unfortunately keeping the logs is difficult (disk full).

ywelsch · 2016-06-30T17:55:20Z

It is tricky to verify that this is indeed #12573 (If so, we could think about backporting #16274). Once the exceptions start bubbling up, the nodes have up-to-date cluster states (i.e. the node with the primary relocation target now has the cluster state where primary relocation is completed). It’s just by unwinding the deep call stack of the recursive calls between the nodes where the exceptions are stacked on top of each other.

Is the rolling restart of the cluster completed by now? If so, are there any shard relocations in progress?

@bleskes thoughts?

nomoa · 2016-06-30T18:43:36Z

Note that I'm pretty sure that the second time it happened the cluster was green but certainly with shards relocating.

I've restarted the master to schedule a new election, we'll monitor the cluster state and comment this ticket with any new relevant info.

I agree with you, I'm not sure that exception timestamp in the logs are relevant because it seems to be a recursion problem and most of the logs where generated by the circuit breaker in ExceptionsHelper#unwrapCause (I wonder if the same kind of circuit breaker should be added to the logger itself to avoid writing bazillions of Caused By lines).

…lper#unwrapCause This code seems to be a circuit breaker to prevent digging too deeply in the exception causes. If we enter this circuit breaker then it's likely that something very bad is happening. Unfortunately we ask the logger to log the full stack of all the causes. It can generate gigs of logs in few minutes and filling up all disk space. It happened on a cluster affected by what seem to be recursion bug (c.f. elastic#19187) where it generated 27gig of logs in less than 5 minutes. While this code is useful to debug problematic exceptions it may generate too many lines causing "no space left on devices" errors thus making debugging the root cause even harder.

jasontedor · 2016-07-06T15:29:46Z

Can you share a pair of subsequent log messages, one with n causes, and the next with n + 2 causes showing the full stack trace for each log entry?

nomoa · 2016-07-06T17:51:06Z

It happened again today.
What seems to be clear is that it happens when the cluster goes back to green after a node restart thus when rebalacing starts.
Cluster settings:

node_concurrent_recoveries: 3
node_initial_primaries_recoveries: 3
cluster_concurrent_rebalance: 16
max_bytes_per_sec is pretty low at 20mb with 3 concurrent_streams (we encounter latency issues if we set more)

It's unclear to me how node_concurrent_recoveries and cluster_concurrent_rebalance interacts together. What happens if the cluster decide to rebalance more than 3 shards to the same node will node_concurrent_recoveries prevent this from happening?

I think that what saves us from OOM is a StackOverflow when the huge exception is serialized.

@jasontedor here is the first 4 log entries : (https://gist.github.com/nomoa/2ee1f8bb44a4c6c01c400787d66bc383)

Here the pattern seems to be 2 with 13 cause, 2 with 15 causes and so on...
The last one I've seen in the log before filling up the disk seemed to have 1085 causes. This single log entry was 54m of text...

ywelsch · 2016-07-07T15:30:04Z

@nomoa I've back-ported the fix in #12573 to 2.4 (#19296). All information so far indicates that it is this issue you're experiencing. Unfortunately my back-port was too late to make it for 2.3.4. You will have to wait for 2.4.0 to test it out. In the meantime, I wonder if dedicated master nodes would help here. If I understand correctly, this issue appeared only when a primary shard on the master node was involved. As cluster states are first applied on all the other nodes before it's applied on the master node, and if cluster state application is slow (due to large number of indices / shards etc.), having dedicated master nodes might decrease the time in which cluster states are out of sync on the nodes holding the primary relocation source and target.

Might also be interesting to increase logging level of "org.elasticsearch.cluster.service" to DEBUG to see how long nodes take to apply the cluster state (messages of the form "processing [{}]: took {} done applying updated cluster_state").

nomoa · 2016-07-07T15:49:55Z

@ywelsch awesome, thanks for the backport.

Yes it always happened on shards where the master was involved, and if I understood correctly this specific issue could happen between two data nodes. Note that It's not the first time we suspect the master being too busy to act properly.

Moving to a dedicated master node is on our todo list, thanks for the suggestions.

rohit01 · 2016-08-22T06:29:22Z

Happening for me as well. I have disabled logging for the time being. Waiting for ES 2.4.0 :)

clintongormley · 2016-08-22T14:31:19Z

Closed by #19296

Once 2.4.0 is out, please ping on this ticket if you're still seeing the same issue.

clintongormley added discuss :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jun 30, 2016

ywelsch removed the discuss label Jun 30, 2016

nomoa mentioned this issue Jul 6, 2016

Don't flood the logs when we detect enormous exception in ExceptionHe… #19285

Closed

clintongormley added the >bug label Jul 6, 2016

clintongormley added the feedback_needed label Jul 8, 2016

clintongormley closed this as completed Aug 22, 2016

clintongormley added the v2.4.0 label Aug 22, 2016

ywelsch mentioned this issue Aug 26, 2016

Prevent TransportReplicationAction to route request based on stale local routing table #19296

Merged

rohit01 mentioned this issue Jan 10, 2017

Unstable graylog rest API under high workload Graylog2/graylog2-server#3326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested RemoteTransportExceptions flood the logs and fill the disk #19187

Nested RemoteTransportExceptions flood the logs and fill the disk #19187

nomoa commented Jun 30, 2016

ywelsch commented Jun 30, 2016

nomoa commented Jun 30, 2016

ywelsch commented Jun 30, 2016

nomoa commented Jun 30, 2016

jasontedor commented Jul 6, 2016 •

edited

Loading

nomoa commented Jul 6, 2016

ywelsch commented Jul 7, 2016

nomoa commented Jul 7, 2016

rohit01 commented Aug 22, 2016

clintongormley commented Aug 22, 2016

Nested RemoteTransportExceptions flood the logs and fill the disk #19187

Nested RemoteTransportExceptions flood the logs and fill the disk #19187

Comments

nomoa commented Jun 30, 2016

ywelsch commented Jun 30, 2016

nomoa commented Jun 30, 2016

ywelsch commented Jun 30, 2016

nomoa commented Jun 30, 2016

jasontedor commented Jul 6, 2016 • edited Loading

nomoa commented Jul 6, 2016

ywelsch commented Jul 7, 2016

nomoa commented Jul 7, 2016

rohit01 commented Aug 22, 2016

clintongormley commented Aug 22, 2016

jasontedor commented Jul 6, 2016 •

edited

Loading