Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested RemoteTransportExceptions flood the logs and fill the disk #19187

Closed
nomoa opened this issue Jun 30, 2016 · 10 comments
Closed

Nested RemoteTransportExceptions flood the logs and fill the disk #19187

nomoa opened this issue Jun 30, 2016 · 10 comments
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. feedback_needed v2.4.0

Comments

@nomoa
Copy link
Contributor

nomoa commented Jun 30, 2016

It happened during a rolling restart needed for a security upgrade. The cluster is running elastic 2.3.3.
All nodes are running the same JVM version (OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)).

A RemoteTransportException seemed to "loop?" between 2 nodes causing elastic to log bigger and bigger exception traces as a new RemoteException exception seemed to be created with the previous one carrying all its causes.

The first trace was (on elastic1045) :

[2016-06-30 08:34:20,553][WARN ][org.elasticsearch        ] Exception cause unwrapping ran for 10 levels...
RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1045][10.64.48.143:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic1036][10.64.16.45:9300][indices:data/write/bulk[s][p]]]; nested: IllegalIndexShardStateException[CurrentState[POST_RECOVERY] operation only allowed when started/recovering, origin [PRIMARY]];
[[11 lines of Caused by: RemoteTransportException]]
Caused by: [itwiki_general_1415230945][[itwiki_general_1415230945][2]] IllegalIndexShardStateException[CurrentState[POST_RECOVERY] operation only allowed when started/recovering, origin [PRIMARY]]
        at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:1062)
        at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:593)
        at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
        at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

The second one (same root cause) appeared few ms after with also 12 causes.
The third and fourth ones had 14 causes, fifth and sixth 16 causes and so on...
The last one I've seen had 1982 chained causes.

The logs were nearly the same on elastic1036 (master) generating 27gig of logs in few minutes on both nodes.

Surprisingly the cluster was still performing relatively well with higher gc activity on these nodes.

Then (maybe 1 hour after the first trace) elastic1045 was dropped from the cluster:

[2016-06-30 09:48:25,953][INFO ][discovery.zen            ] [elastic1045] master_left [{elastic1036}{DUOG0aGqQ3Gajr_wcFTOyw}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]

It was immediately re-added and the log flood stopped.

I'll comment on this ticket if it happens again.

@clintongormley clintongormley added discuss :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jun 30, 2016
@ywelsch
Copy link
Contributor

ywelsch commented Jun 30, 2016

I think that this is a manifestation of #12573. It can happen when the target node of a primary relocation takes a long time to apply the cluster state that contains the information that it has the new primary. It is fixed in the upcoming v5.0.0 (#16274). The question is why the node took so long to apply a cluster state update. Is there anything else in the logs that might indicate this? What is the time stamp of the last log entry that has one of those huge exception traces?

@ywelsch ywelsch removed the discuss label Jun 30, 2016
@nomoa
Copy link
Contributor Author

nomoa commented Jun 30, 2016

@ywelsch this is still unclear, the first trace was 08:34:20,553 and the last one was a StackOverflow at 09:48:19,128 :

[2016-06-30 09:48:19,128][WARN ][action.bulk              ] [elastic1036] Failed to send response for indices:data/write/bulk[s]
java.lang.StackOverflowError
        at java.lang.Exception.<init>(Exception.java:84)
        at java.lang.RuntimeException.<init>(RuntimeException.java:80)
        at org.elasticsearch.ElasticsearchException.<init>(ElasticsearchException.java:84)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.<init>(NotSerializableExceptionWrapper.java:41)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:560)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.writeTo(NotSerializableExceptionWrapper.java:65)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:564)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.writeTo(NotSerializableExceptionWrapper.java:65)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:564)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.writeTo(NotSerializableExceptionWrapper.java:65)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:564)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.writeTo(NotSerializableExceptionWrapper.java:65)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:564)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)
        at org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper.writeTo(NotSerializableExceptionWrapper.java:65)
        at org.elasticsearch.common.io.stream.StreamOutput.writeThrowable(StreamOutput.java:564)
        at org.elasticsearch.ElasticsearchException.writeTo(ElasticsearchException.java:226)

Note that It's happening again right now with two other nodes elastic1021 and elastic1036 (still master).

Unfortunately keeping the logs is difficult (disk full).

@ywelsch
Copy link
Contributor

ywelsch commented Jun 30, 2016

It is tricky to verify that this is indeed #12573 (If so, we could think about backporting #16274). Once the exceptions start bubbling up, the nodes have up-to-date cluster states (i.e. the node with the primary relocation target now has the cluster state where primary relocation is completed). It’s just by unwinding the deep call stack of the recursive calls between the nodes where the exceptions are stacked on top of each other.

Is the rolling restart of the cluster completed by now? If so, are there any shard relocations in progress?

@bleskes thoughts?

@nomoa
Copy link
Contributor Author

nomoa commented Jun 30, 2016

Note that I'm pretty sure that the second time it happened the cluster was green but certainly with shards relocating.

I've restarted the master to schedule a new election, we'll monitor the cluster state and comment this ticket with any new relevant info.

I agree with you, I'm not sure that exception timestamp in the logs are relevant because it seems to be a recursion problem and most of the logs where generated by the circuit breaker in ExceptionsHelper#unwrapCause (I wonder if the same kind of circuit breaker should be added to the logger itself to avoid writing bazillions of Caused By lines).

nomoa added a commit to nomoa/elasticsearch that referenced this issue Jul 6, 2016
…lper#unwrapCause

This code seems to be a circuit breaker to prevent digging too deeply in the
exception causes.
If we enter this circuit breaker then it's likely that something very bad is
happening. Unfortunately we ask the logger to log the full stack of all the causes.
It can generate gigs of logs in few minutes and filling up all disk space.
It happened on a cluster affected by what seem to be recursion bug (c.f. elastic#19187)
where it generated 27gig of logs in less than 5 minutes.
While this code is useful to debug problematic exceptions it may generate too
many lines causing "no space left on devices" errors thus making debugging the
root cause even harder.
@jasontedor
Copy link
Member

jasontedor commented Jul 6, 2016

Can you share a pair of subsequent log messages, one with n causes, and the next with n + 2 causes showing the full stack trace for each log entry?

@nomoa
Copy link
Contributor Author

nomoa commented Jul 6, 2016

It happened again today.
What seems to be clear is that it happens when the cluster goes back to green after a node restart thus when rebalacing starts.
Cluster settings:

  • node_concurrent_recoveries: 3
  • node_initial_primaries_recoveries: 3
  • cluster_concurrent_rebalance: 16
  • max_bytes_per_sec is pretty low at 20mb with 3 concurrent_streams (we encounter latency issues if we set more)

It's unclear to me how node_concurrent_recoveries and cluster_concurrent_rebalance interacts together. What happens if the cluster decide to rebalance more than 3 shards to the same node will node_concurrent_recoveries prevent this from happening?

I think that what saves us from OOM is a StackOverflow when the huge exception is serialized.

@jasontedor here is the first 4 log entries : (https://gist.github.com/nomoa/2ee1f8bb44a4c6c01c400787d66bc383)

Here the pattern seems to be 2 with 13 cause, 2 with 15 causes and so on...
The last one I've seen in the log before filling up the disk seemed to have 1085 causes. This single log entry was 54m of text...

@ywelsch
Copy link
Contributor

ywelsch commented Jul 7, 2016

@nomoa I've back-ported the fix in #12573 to 2.4 (#19296). All information so far indicates that it is this issue you're experiencing. Unfortunately my back-port was too late to make it for 2.3.4. You will have to wait for 2.4.0 to test it out. In the meantime, I wonder if dedicated master nodes would help here. If I understand correctly, this issue appeared only when a primary shard on the master node was involved. As cluster states are first applied on all the other nodes before it's applied on the master node, and if cluster state application is slow (due to large number of indices / shards etc.), having dedicated master nodes might decrease the time in which cluster states are out of sync on the nodes holding the primary relocation source and target.

Might also be interesting to increase logging level of "org.elasticsearch.cluster.service" to DEBUG to see how long nodes take to apply the cluster state (messages of the form "processing [{}]: took {} done applying updated cluster_state").

@nomoa
Copy link
Contributor Author

nomoa commented Jul 7, 2016

@ywelsch awesome, thanks for the backport.

Yes it always happened on shards where the master was involved, and if I understood correctly this specific issue could happen between two data nodes. Note that It's not the first time we suspect the master being too busy to act properly.

Moving to a dedicated master node is on our todo list, thanks for the suggestions.

@rohit01
Copy link

rohit01 commented Aug 22, 2016

Happening for me as well. I have disabled logging for the time being. Waiting for ES 2.4.0 :)

@clintongormley
Copy link
Contributor

Closed by #19296

Once 2.4.0 is out, please ping on this ticket if you're still seeing the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. feedback_needed v2.4.0
Projects
None yet
Development

No branches or pull requests

5 participants